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PREFACE 


In the real world, systems designed to extract signals from noisy measurements are 
plagued by errors evolving from constraints of the sensors employed, to random 
disturbances and noise and probably, most common, the lack of precise knowledge 
of the underlying physical phenomenology generating the process in the first place! 
Methods capable of extracting the desired signal from hostile environments require 
approaches that capture all of the a priori information available and incorporate them 
into a processing scheme. This approach is typically model-based [1] employing 
mathematical representations of the component processes involved. However, the 
actual implementation providing the algorithm evolves from the realm of statistical 
signal processing using a Bayesian approach based on Bayes’ rule. Statistical signal 
processing is focused on the development of processors capable of extracting the 
desired information from noisy, uncertain measurement data. This is a text that devel¬ 
ops the “Bayesian approach” to statistical signal processing for a variety of useful 
model sets. It features the next generation of processors which have recently been 
enabled with the advent of high speed/high throughput computers. The emphasis is 
on nonlinear/non-Gaussian problems, but classical techniques are included as special 
cases to enable the reader familiar with such methods to draw a parallel between the 
approaches. The common ground is the model sets. Here the state-space approach 
is emphasized because of its inherent applicability to a wide variety of problems 
both linear and nonlinear as well as time invariant and time-varying problems includ¬ 
ing what has become popularly termed “physics-based” models. This text brings 
the reader from the classical methods of model-based signal processing including 
Kalman filtering for linear, linearized and approximate nonlinear processors as well 
as the recently developed unscented or sigma-point filters to the next generation of 
processors that will clearly dominate the future of model-based signal processing for 
years to come. It presents a unique viewpoint of signal processing from the Bayesian 
perspective in contrast to the pure statistical approach found in many textbooks. 
Although designed primarily as a graduate textbook, it will prove very useful to the 
practicing signal processing professional or scientist, since a wide variety of appli¬ 
cations are included to demonstrate the applicability of the Bayesian approach to 
real-world problems. The prerequisites for such a text is a melding of undergraduate 
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work in linear algebra, random processes, linear systems, and digital signal process¬ 
ing as well as a minimal background in model-based signal processing illustrated 
in the recent text [1], It is unique in the sense that few texts cover the breadth of its 
topics, whereas, the underlying theme of this text is the Bayesian approach that is uni¬ 
formly developed and followed throughout in the algorithms, examples, applications 
and case studies. It is this theme coupled with the hierarchy of physics-based models 
developed that contribute to its uniqueness. This text has evolved from three previ¬ 
ous texts, Candy [1-3] coupled with a wealth of practical applications to real-world 
Bayesian problems. 

The Bayesian approach has existed in statistical physics for a long time and can 
be traced back to the 1940s with the evolution of the Manhattan project and the 
work of such prominent scientists as Ulam, von Neumann, Metropolis, Fermi, Feyn¬ 
man, and Teller. Here the idea of Monte Carlo (MC) techniques to solve complex 
integrals evolved [4], Since its birth, Monte Carlo related methods have been the 
mainstay of many complex statistical computations. Many applications have evolved 
from this method in such areas as physics, biology, chemistry, computer science, 
economics/finance, material science, statistics and more recently in engineering. 
Thus, statisticians have known for a long time about these methods, but their prac¬ 
ticalities have not really evolved as a working tool until the advent of high speed 
super computers around the 1980s. In signal processing it is hard to pinpoint the 
actual initial starting point but clearly the work of Handschin and Mayne in the late 
1960s and early 1970s [5, 6] was the initial evolution of Monte Carlo techniques for 
signal processing and control. However from the real-time perspective, it is probably 
the development of the sequential Bayesian processor made practical by the work of 
Gordon, Salmond and Smith in 1993 [7] enabling the evolution and the explosion of 
the Bayesian sequential processor that is currently being researched today. To put this 
text in perspective we must discuss the current signal processing texts available on 
Bayesian processing. Since its evolution much has been published in the statistical lit¬ 
erature on Bayesian techniques for statistical estimation; however, the earliest texts are 
probably those of Harvey [8], Kitigawa and Gersch [9] and West [10] which empha¬ 
size the Bayesian model-based approach incorporating dynamic linear or nonlinear 
models into the processing scheme for additive Gaussian noise sources leading to the 
classical approximate (Kalman) filtering solutions. These works extend those results 
to non-Gaussian problems using Monte Carlo techniques for eventual solution laying 
the foundation for works to follow. Statistical MC techniques were also available, but 
not as accessible to the signal processor due to statistical jargon and abstractness of 
the discussions. Many of these texts have evolved during the 1990s such as Gilks [11], 
Robert [12], Tanner [13], Tanizaki [14], with the more up-to-date expositions evolving 
in the late 1990s and currently such as Liu [4], Ruanaidh [15], Haykin [16], Doucet 
[17], Ristic [18] and Cappe [19], Also during the last period a sequence of tutorials 
and special IEEE issues evolved exposing the MC methods to the signal processing 
community such as Godsill [20], Arulampalam [21], Djuric [22], Haykin [23] and 
Doucet [24], Candy [25], as well as a wealth of signal processing papers (see refer¬ 
ences for details). Perhaps the most complete textbook from the statistical researcher’s 
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perspective is that of Cappe [19]. In this text much of the statistical MC sampling 
theory is developed along with all of the detailed mathematics—ideal for an evolving 
researcher. But what about the entry level person—the engineer, the experimental¬ 
ist, and the practitioner? This is what is lacking in all of this literature. Questions 
like, how do the MC methods relate to the usual approximate Kalman methods? How 
does one incorporate models (model-based methods) into a Bayesian processor? How 
does one judge performance compared with classical methods? These are all basi¬ 
cally pragmatic questions that the proposed text will answer in a lucid manner through 
coupling the theory to real-world examples and applications. Thus, the goal of this 
text is to provide a bridge for the practitioners with enough theory and applications to 
provide the basic background to comprehend the Bayesian framework and enable the 
application of these powerful techniques to real-world problem solving. Next, let us 
discuss the structure of the proposed text in more detail to understand its composition 
and approach. 

We first introduce the basic ideas and motivate the need for such processing while 
showing that they clearly represent the next generation of processors. We discuss 
potential application areas and motivate the requirement for such a generalization. 
That is, we discuss how the simulation-based approach to Bayesian processor design 
provides a much needed capability, while well known in the statistical community, 
not very well known (until recently) in the signal processing community. After intro¬ 
ducing the basic concepts in Chapter 1, we begin with the basic Bayesian processors 
in Chapter 2. We start with the Bayesian “batch” processor and establish its con¬ 
struction by developing the fundamental mathematics required. Next we discuss the 
well-known maximum likelihood (ML) and minimum (error) variance {MV ) or equiv¬ 
alently minimum mean-squared error {MMSE) processors. We illustrate the similarity 
and differences between the schemes. Next we launch into sequential Bayesian pro¬ 
cessing schemes which forms the foundation of the text. By examining the “full” 
posterior distribution in both dynamic variables of interest as well as the full data 
set, we are able to construct the sequential Bayesian approach and focus on the usual 
filtered ox filtering distribution case of highest interest demonstrating the fundamen¬ 
tal prediction/update recursions inherent in the sequential Bayesian structure. Once 
establishing the general Bayesian sequential processor {BSP) the schemes that follow 
are detailed depending on the assumed distribution with a variety of model sets. 

We briefly review simulation-based methods starting with sampling methods, pro¬ 
gressing to Monte Carlo approaches leading to the basic iterative methods of sampling 
using the Metropolis, Metropolis-Hastings, Gibb’s and slice samplers. Since one of 
the major motivations of recursive or sequential Bayesian processing is to provide a 
real-time or pseudo real-time processor, we investigate the idea of importance sam¬ 
pling as well as sequential importance sampling techniques leading to the generic 
Bayesian sequential importance sampling algorithm. Here we show the solution can 
be applied, once the importance sampling distribution is defined. 

In order to be useful, Bayesian processing techniques must be specified through 
a set of models that represent the underlying phenomenology driving the particular 
application. For example, in radar processing we must investigate the propagation 
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models, tracking models, geometric models, and so forth. In Chapter 4, we develop the 
state-space approach to signal modeling which forms the basis of many applications 
such as speech, radar, sonar, acoustics, geophysics, communications, control, etc. 
Here we investigate continuous, sampled-data and discrete state-space signals and 
systems. We also discuss the underlying systems theory and extend the model-set to 
include the stochastic case with noise driving both process and measurements leading 
the well-known Gauss-Markov ( GM ) representation which forms the starting point 
for the classical Bayesian processors to follow. We also discuss the equivalence of the 
state-space model to a variety of time series ( ARMA , AR, MA, etc.) representations 
as well as the common engineering model sets (transfer functions, all-pole, all-zero, 
pole-zero, etc.). This discussion clearly demonstrates why the state-space model 
with its inherent generality is capable of capturing the essence of a broad variety of 
signal processing representations. Finally, we extend these ideas to nonlinear state- 
space models leading to “approximate” Gauss-Markov representation evolving from 
nonlinear, perturbed and linearized systems. 

In the next chapter, we develop classical Bayesian processors by first motivating 
the Bayesian approach to the state-space where the required conditional distributions 
use the embedded state-space representation. Starting with the linear, time-varying, 
state-space models, we show that the “optimum” classical Bayesian processor under 
multivariate Gaussian assumptions leads to minimum (error) variance (MV) or equiv¬ 
alently minimum mean-squared error (MMSE), which is the much heralded Kalman 
filter of control theory [1], That is, simply substituting the underlying Gauss-Markov 
model into the required conditional distributions leads directly to the BSP or Kalman 
filter in this case. These results are then extended to the nonlinear state-space repre¬ 
sentation which are linearized using a known reference trajectory through perturbation 
theory and Taylor-series expansions. Starting with the linearized or approximate GM 
model of Chapter 4, we again calculate the required Bayesian sequential processor 
from the conditionals which lead to the “linearized” BSP (or linearized Kalman filter) 
algorithm. Once this processor is developed, it is shown that the “extended” Bayesian 
processor follows directly by linearizing about the most currently available estimate 
rather than the reference trajectory. The extended Bayesian processor (XBP) or equiv¬ 
alently extended Kalman filter ( EKF ) of nonlinear processing theory evolves quite 
naturally from the Bayesian perspective, again following the usual development by 
defining the required conditionals, making nonlinear approximations and develop¬ 
ing the posterior distributions under multivariate Gaussian assumptions. Next, we 
briefly investigate an iterative version of the XBP processor, again from the Bayesian 
perspective which leads directly to the iterative version of the extended Bayesian pro¬ 
cessor (IX-BP) algorithm—an effective tool when nonlinear measurements dominate 
the uncertain measurements required. 

Chapter 6 focuses on statistical linearization methods leading to the modem 
unscented Bayesian processor ( UBP ) or equivalently sigma-point Bayesian proces¬ 
sor ( SPBP ). Here we show how statistical linearization techniques can be used to 
transform the underlying probability distribution using the sigma-point or unscented 
nonlinear transformation technique (linear regression) leading to the unscented 
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Bayesian processor or equivalently the unscented Kalman filter ( UKF ). Besides devel¬ 
oping the fundamental theory and algorithm, we demonstrate its performance on a 
variety of example problems. We also briefly discuss the Gaussian-Hermite quadrature 
( G-H ) and Gaussian sum ( G-S ) techniques for completeness. 

We reach the heart of the particle filtering methods in Chapter 7, where we discuss 
the Bayesian approach to the state-space. Here the ideas of Bayesian and model- 
based processors are combined through the development of Bayesian state-space 
particle filters. Initially, it is shown how the state-space models of Chapter 4 are 
incorporated into the conditional probability distributions required to construct the 
sequential Bayesian processors through importance sampling constructs. After inves¬ 
tigating a variety of importance proposal distributions, the basic set of state-space 
particle filters ( SSPF ) are developed and illustrated through a set of example prob¬ 
lems and simulations. The techniques including the Bootstrap, auxiliary, regularized 
MCMC and linearized particle filters are developed and investigated when applied to 
the set of example problems used to evaluate algorithm performance. 

In Chapter 8 the important joint Bayesian SSPF are investigated by first developing 
the joint filter popularly known as the parametrically adaptive processor [ 1 ]. Here both 
states and static as well as dynamic parameters are developed as solutions to this joint 
estimation problem. The performance of these processors are compared to classical 
and modern processors through example problems. 

In Chapter 9 the hidden Markov models ( HMM ) are developed for event related 
problems (e.g., Poisson point processes). This chapter is important in order to place 
purely discrete processes into perspective. HMM evolve for any type of memoryless, 
counting processes and become important in financial applications, communications, 
biometrics, as well as radiation detection. Here we briefly develop the fundamental 
ideas and discuss them in depth to develop a set of techniques used by the practicioner 
while applying them to engineering problems of interest. 

In the final chapter, we investigate a set of physics-based applications focusing on 
the Bayesian approach to solving real-world problems. By progressing through a step- 
by-step development of the processors, we see explicitly how to develop and analyze 
the performance of such Bayesian processors. We start with a practical laser alignment 
problem followed by a broadband estimation problem in ocean acoustics. Next the 
solid-state microelectromechanical {MEM) sensor problem for biothreat detection is 
investigated followed by a discrete radiation detection problem based on counting 
statistics. All of these methods invoke Bayesian techniques to solve the particular 
problems of interest enabling the practitioner the opportunity to track “real-world” 
Bayesian model-based solutions. 

The place of such a text in the signal processing textbook community can best 
be explained by tracing the technical ingredients that comprise its contents. It can 
be argued that it evolves from the digital signal processing area primarily from those 
texts that deal with random or statistical signal processing or possibly more succinctly 
“signals contaminated with noise.” The texts by Kay [26-28], Therrien [29], Brown 
[30] all provide the basic background information in much more detail than this text, 
so there is little overlap at the detailed level with them. 
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This text, however, possesses enough theory for the graduate or advanced graduate 
student to develop a fundamental basis to go onto more rigorous texts like Jazwinski 
[31], Sage [32], Gelb [33], Anderson [34], Maybeck [35], Bozic [36], Kailath [37, 
38], and more recently, Mendel [39], Grewel [40], Bar-Shalom [41] and Simon [42], 
These texts are rigorous and tend to focus on Kalman filtering techniques ranging 
from continuous to discrete with a wealth of detail on all of their variations. The 
Bayesian approach discussed in this text certainly includes the state-space models 
as one of its model classes (probably the most versatile), but the emphasis is on 
various classes of models and how they may be used to solve a wide variety of signal 
processing problems. Some of the more recent texts about the same technical level, 
but again, with a different focus: are Widrow [43], Orfanidis [44], Sharf [45], Haykin 
[46], Hayes [47], Brown [30] and Stoica [48]. Again the focus of these texts is not the 
Bayesian approach but more on narrow set of specific models and the development 
of a variety of algorithms to estimate these sets. The system identification literature 
and texts therein also provide some overlap with this text, but again the approach is 
focused on estimating a model from noisy data sets and not really aimed at developing 
a Bayesian solution to a particular signal processing problem. The texts in this area 
are Ljung [49, 50], Goodwin [51], Norton [52] and Soderstrom [53], 

The recent particle filtering texts of Ristic [18] and Cappe [19] are useful as refer¬ 
ences to accompany this text, especially if more details are required on the tracking 
problem and the fundamental theorems governing statistical properties and conver¬ 
gence proofs. That is, Ristic’s text provides a introduction that closely follows the 
2002 tutorial paper by Arulampalam [21] but provides little of the foundational mate¬ 
rial necessary to comprehend this approach. It focuses primarily on the tracking 
problem. Cappe’s text is at a much more detailed technical level and is written for 
researcher’s in this area not specifically aimed at the practitioner’s viewpoint. The pro¬ 
posed text combines the foundational material, some theory along with the practice 
and application of PF to real-world applications and examples. 

The approach we take is to introduce the basic idea of Bayesian signal processing 
and show where it fits in terms of signal processing. It is argued that BSP is a natural 
way to solve basic processing problems. The more a priori information we know 
about data and its evolution, the more information we can incorporate into the pro¬ 
cessor in the form of mathematical models to improve its overall performance. This 
is the theme and structure that echoes throughout the text. Current applications (e.g., 
structures, tracking, equalization, biomedical) and simple examples are incorporated 
to motivate the signal processor. Examples are discussed to motivate all of the models 
and prepare the reader for further developments in subsequent chapters. In each case 
the processor along with accompanying simulations are discussed and applied to var¬ 
ious data sets demonstrating the applicability and power of the Bayesian approach. 
The proposed text is linked to the MATLAB (signal processing standard software) 
software package providing notes at the end of each chapter. 

In summary, this Bayesian signal processing text will provide a much needed 
“down-to-earth” exposition of modern MC techniques. It is coupled with well-known 
signal processing model sets along with examples and problems that can be used 
to solve many real-world problems by practicing engineers and scientists along 
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with entry level graduate students as well as advanced undergraduates and post¬ 
doctorates requiring a solid introduction to the “next generation” of model-based 
signal processing techniques. 


James V. Candy 

Danville, California 
January 2009 
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INTRODUCTION 


1.1 INTRODUCTION 

In this chapter we motivate the philosophy of Bayesian processing from a probabilistic 
perspective. We show the coupling between model-based signal processing ( MBSP) 
incorporating the a priori knowledge of the underlying processes and the Bayesian 
framework for specifying the distribution required to develop the processors. The idea 
of the sampling approach evolving from Monte Carlo ( MC ) and Markov chain Monte 
Carlo ( MCMC ) methods is introduced as a powerful methodology for simulating the 
behavior of complex dynamic processes and extracting the embedded information 
required. The main idea is to present the proper perspective for the subsequent chapters 
and construct a solid foundation for solving signal processing problems. 


1.2 BAYESIAN SIGNAL PROCESSING 

The development of Bayesian signal processing has evolved in a manner proportional 
to the evolution of high performance/high throughput computers. This evolution has 
led from theoretically appealing methods to pragmatic implementations capable of 
providing reasonable solutions for nonlinear and highly multi-modal (multiple dis¬ 
tribution peaks) problems. In order to fully comprehend the Bayesian perspective, 
especially for signal processing applications, we must be able to separate our thinking 
and in a sense think more abstractly about probability distributions without worrying 
about how these representations can be “applied” to realistic processing problems. Our 
motivation is to first present the Bayesian approach from a statistical viewpoint and 
then couple it to useful signal processing implementations following the well-known 
model-based approach [1, 2], Here we show that when we constrain the Bayesian 
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distributions in estimation to Markovian representations using primarily state-space 
models, we can construct sequential processors capable of “pseudo real-time” oper¬ 
ations that are easily be utilized in many physical applications. Bayes’ rule provides 
the foundation of all Bayesian estimation techniques. We show how it can be used to 
both theoretically develop processing techniques based on a specific distribution (e.g., 
Poisson, Gaussian, etc.) and then investigate properties of such processors relative to 
some of the most well-known approaches discussed throughout texts in the field. 

Bayesian signal processing is concerned with the estimation of the underlying 
probability distribution of a random signal in order to perform statistical inferences [3]. 
These inferences enable the extraction of the signal from noisy uncertain measurement 
data. For instance, consider the problem of extracting the random variate, say X, from 
the noisy data, Y. The Bayesian approach is to first estimate the underlying conditional 
probability distribution, Pr(X | Y), and then perform the associated inferences to extract 
X, that is, 

Pr(X|F) ■=» X = argmax Pr(X|F) 

where the caret, X denotes an estimate of X. This concept of estimating the under¬ 
lying distribution and using it to extract a signal estimate provides the foundation of 
Bayesian signal processing developed in this text. 

Let us investigate this idea in more detail. We start with the previous problem of 
trying to estimate the random parameter, X, from noisy data Y = y. Then the associ¬ 
ated conditional distribution Pr(Aj Y = y) is called the posterior distribution because 
the estimate is conditioned “after {post) the measurements” have been acquired. Esti¬ 
mators based on this a posteriori distribution are usually called Bayesian because 
they are constructed from Bayes’ rule, since Pr(Z|F) is difficult to obtain directly. 
That is, 


Pr(X|7) 


Pr(FpST) x Pr(X) 


( 1 . 1 ) 


where Pr(X) is called the prior distribution (before measurement), Pr(7|Z) is called 
the likelihood (more likely to be true) and Pr(K) is called the evidence (scales the 
posterior to assure its integral is unity). Bayesian methods view the sought after 
parameter as random possessing a “known” a priori distribution. As measurements 
are made, the prior is transformed to the posterior distribution function adjusting the 
parameter estimates. Thus, the result of increasing the number of measurements is 
to improve the a posteriori distribution resulting in a sharper peak closer to the true 
parameter as shown in Fig. 1.1. 

When the variates of interest are dynamic, then they are functions of time and 
therefore, X t —?X and Y, -> Y. Bayes’ rule for the joint dynamic distribution is 


Pr(X,| Y t ) 


Pr(Y,\X t ) x Pr(X t ) 
Pr( Y,) 


( 1 . 2 ) 


In Bayesian theory, the posterior defined by Pr(X t \Y t ) is decomposed in terms 
of the prior Pr(X r ), its likelihood Yr(Y t \X t ) and the evidence or normalizing factor, 
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FIGURE 1.1 Bayesian estimation of the random variate X transforming the prior, Pr(X) to 
the posterior, Pr(X| Y) using Bayes' rule. 


Pr(Yf). Bayesian signal processing in this dynamic case follows the identical path, 
that is, 


Pr(X,| K r ) =>■ X t = argmax Pr(X,| Y t ) 

So we begin to see the versatility of the Bayesian approach to random signal 
processing. Once the posterior distribution is determined, then all statistical inferences 
or estimates are made. For instance, suppose we would like to obtain the prediction 
distribution. Then it can be obtained as 

Pr(X, +1 |T r ) = J Pr(X /+l \X t , Y t ) x Yr(X t \Y t )dX t 
and a point estimate might be the conditional mean of this distribution, that is, 
E{X t+l \Y t } = j X t+ \Vr{X t+ \ | Y t )dX t+ \ 

This relation shows how information that can be estimated from the extracted 
distribution is applied in the estimation context by performing statistical inferences. 

Again, even though the Bayesian signal processing concept is simple, conceptually, 
the real problem to be addressed is that of evaluating the integrals which is very 
difficult because they are only analytically tractable for a small class of priors and 
likelihood distributions. The large dimensionality of the integrals cause numerical 
integration techniques to break down, which leads to the approximations we discuss 
subsequently for stabilization. Next let us consider the various approaches taken 
to solve the probability distribution estimation problems using non-parametric or 
parametric representations. This will eventually lead to the model-based approach [4], 
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1.3 SIMULATION-BASED APPROACH TO BAYESIAN PROCESSING 

The simulation-based approach to Bayesian processing is founded on Monte Carlo 
(MC) methods that are stochastic computational techniques capable of efficiently 
simulating highly complex systems. Historically motivated by games of chance and 
encouraged by the development of the first electronic computer (ENIAC), the MC 
approach was conceived by Ulam (1945), developed by Ulam, Metropolis and von 
Neumann (1947) and coined by Metropolis (1949) [5-9], The method evolved in 
the mid-1940s during the Manhattan project by scientists investigating calculations 
for atomic weapon designs [10]. It evolved further from such areas as computational 
physics, biology, chemistry, mathematics, engineering, materials and finance to name 
a few. Monte Carlo methods offer an alternative approach to solving classical numer¬ 
ical integration and optimization problems. Inherently, as the dimensionality of the 
problem increases classical methods are prone to failure while MC methods tend to 
increase their efficiency by reducing the error—an extremely attractive property. For 
example, in the case of classical grid-based numerical integration or optimization 
problems as the number of grid points increase along with the number of problem 
defining vector components, there is an accompanying exponential increase in com¬ 
putational time [10-15], The stochastic MC approach of selecting random samples 
and averaging over a large number of points actually reduces the computational error 
by the Law of Large Numbers irrespective of the problem dimensionality. It utilizes 
Markov chain theory as its underlying foundation establishing the concept that through 
random sampling the resulting “empirical” distribution converges to the desired pos¬ 
terior called the stationary or invariant distribution of the chain. Markov chain Monte 
Carlo ( MCMC ) techniques are based on sampling from probability distributions based 
on a Markov chain, which is a stochastic system governed by a transition probability, 
having the desired posterior distribution as its invariant distribution. Under certain 
assumptions the chain converges to the desired posterior through proper random 
sampling as the number of samples become large—a crucial property (see [10] for 
details). Thus, the Monte Carlo approach has evolved over a long time period and 
is well understood by scientists and statisticians, but it must evolve even further to 
be useful for signal processors to become an effective tool in their problem solving 
repertoire. 

Perhaps the best way to visualize the MC methods follows directly from the exam¬ 
ple of Frenkel [11], Suppose that a reasonable estimate of the depth of the Mississippi 
river is required. Using numerical quadrature techniques the integrand value is mea¬ 
sured at prespecified grid points. We also note that the grid points may not be in 
regions of interest and, in fact, the integrand may vanish as shown in Fig. 1.2. On the 
other hand, the surveyor is in the Mississippi and performing a (random) walk within 
the river measuring the depth of the river directly. In this sampling approach mea¬ 
surements are accepted as long as the surveyor is in the river and rejected if outside. 
Here the “average” depth is simply the sample average of the measurements much 
the same as a sampling technique might perform. So we see that a refinement of the 
brute force integration approach is to use random points or samples that “most likely” 
come from regions of high contribution to the integral rather than from low regions. 
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Numerical Integration 



FIGURE 1.2 Monte Carlo sampling compared with numerical grid based integration for 
depth of Mississippi estimation. 


Modem MC techniques such as in numerical integration seek to select ran¬ 
dom samples in high regions of concentration of the integrand by drawing samples 
from a proposed function very similar to the integrand. These methods lead to the 
well-known importance sampling approaches (see Chapter 3). Besides numerical inte¬ 
gration problems that are very important in statistical signal processing for extracting 
signals/parameters of interest, numerical optimization techniques (e.g., genetic algo¬ 
rithms, simulated annealing, etc.) benefit directly from sampling technology. This 
important discovery has evolved ever since and become even more important with 
the recent development of high speed/high throughput computers. 

Consider the following simple example of estimating the area of a circle to illustrate 
the MC approach. 


Example 1.1 

Define a sample space bounded by a square circumscribing (same center) a circle 
of radius r. Draw uniform random samples say z := ( X, Y) such that z ~ U{—r, +r); 
therefore, the number of random samples drawn from within the circle of radius r to 
the number of total samples drawn (bounded by the square) defines the probability 

p ^ ^ No. circle samples 

Total No. of (square) samples 

From geometry we know that the probability is simply the ratio of the two areas 
(circle-to-square), that is, 


Pr ( Z = *) = i ^=*/ 4 

























6 INTRODUCTION 


Let r= 1, then a simple computer code can be written that: 

• Draws the X,T-coordinates from z ~ U{— 1, +1); 

• Calculates the range function, p = ~Jx 2 + Y 2 ; 

• Counts the number of samples that are less than or equal to p; 

• Estimates the probability, Pr(Z = z). 

The area is determined by multiplying the estimated probability by the area of the 
square. The resulting sample scatter plot is shown in Fig. 1.3 for a 10,000 sample 
realization resulting in it ^ 3.130. As the number of samples increase the estimate of 
the area (it) gets better and better demonstrating the MC approach. AAA 

In signal processing, we are usually interested in some statistical measure of a 
random signal or parameter usually expressed in terms of its moments [16-23]. For 
example, suppose we have some signal function, say f{X), with respect to some 
underlying probabilistic distribution, Pr(X). Then a typical measure to seek is its 
performance “on the average” which is characterized by the expectation 

ExifiX)} = J f(X) Pr(X) dX (1.3) 



Area = 3.130 


FIGURE 1.3 Area of a circle of unit radius using a Monte Carlo approach (area 
estimated as 3.130 using 10,000 samples). 
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Instead of attempting to use direct numerical integration techniques, stochastic 
sampling techniques or Monte Carlo integration is an alternative. As mentioned, 
the key idea embedded in the MC approach is to represent the required distribution as 
a set of random samples rather than a specific analytic function (e.g., Gaussian). As the 
number of samples becomes large, they provide an equivalent (empirical) 
representation of the distribution enabling moments to be estimated directly 
(inference). 

Monte Carlo integration draws samples from the required distribution and then 
forms sample averages to approximate the sought after distributions. That is, MC 
integration evaluates integrals by drawing samples, {X(i)} from the designated distri¬ 
bution Pr(X). Assuming perfect sampling, this produces the estimated or empirical 
distribution given by 


MX )« 4 V] S(X - X(i)) 

N “ 

which is a probability mass distribution with weights, jj and random variable or 
sample, Xii). Substituting the empirical distribution into the integral gives 


Ex[f(X)} = jf{X) Pr(X) dX « i J2f(X(0) = / (1.4) 


which follows directly from the sifting property of the delta or impulse function. Here 
/ is said to be a Monte Carlo estimate of Ex{f(X)}. 

As stated previously, scientists (Ulam, von Neumann, Metropolis, Fermi, Teller, 
etc. [7]) created statistical sampling-based or equivalently simulation-based methods 
for solving problems efficiently (e.g., neutron diffusion or eigenvalues of the 
Schrodinger relation). The MC approach to problem solving is a class of stochastic 
computations to simulate the dynamics of physical or mathematical systems captur¬ 
ing their inherent uncertainties. The MC method is a powerful means for generating 
random samples used in estimating conditional and marginal probability distributions 
required for statistical estimation and therefore signal processing. It offers an alter¬ 
native numerical approach to find solutions to mathematical problems that cannot 
easily be solved by integral calculus or other numerical methods. As mentioned, the 
efficiency of the MC method increases (relative to other approaches) as the problem 
dimensionality increases. It is useful for investigating systems with a large number of 
degrees of freedom (e.g., energy transport, materials, cells, genetics) especially for 
systems with input uncertainty [5]. 

These concepts have recently evolved to the signal processing area and are of high 
interest in nonlinear estimation problems especially in model-based signal processing 
applications [16] as discussed next. 
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1.4 BAYESIAN MODEL-BASED SIGNAL PROCESSING 

The estimation of probability distributions required to implement Bayesian proces¬ 
sors is at the heart of this approach. How are these distributions obtained from data or 
simulations? Nonparametric methods of distribution estimation ranging from simple 
histogram estimators to sophisticated kernel smoothing techniques rooted in classi¬ 
fication theory [3] offer reasonable approaches when data are available. However, 
these approaches usually do not take advantage of prior knowledge about the under¬ 
lying physical phenomenology generating the data. An alternative is to parameterize 
the required distributions by prior knowledge of their actual form (e.g., exponential, 
Poisson, etc.) and fit their parameters from data using optimization techniques [3], 
Perhaps the ideal realization is the parameterization of the evolution dynamics associ¬ 
ated with the physical phenomenology using underlying mathematical representation 
of the process combined with the data samples. This idea provides the essence of the 
model-based approach to signal processing which (as we shall see) when combined 
with the Bayesian processors provide a formidable tool to attack a wide variety of 
complex processing problems in a unified manner. An alternative view of the underly¬ 
ing processing problem is to decompose it into a set of steps that capture the strategic 
essence of the processing scheme. Inherently, we believe that the more a priori knowl¬ 
edge about the measurement and its underlying phenomenology we can incorporate 
into the processor, the better we can expect the processor to perform—as long as the 
information that is included is correct! One strategy called the model-based approach 
provides the essence of model-based signal processing [1]. 

Simply stated, the model-based approach is “incorporating mathematical models 
of both physical phenomenology and the measurement process (including noise) into 
the processor to extract the desired information.” This approach provides a mechanism 
to incorporate knowledge of the underlying physics or dynamics in the form of math¬ 
ematical process models along with measurement system models and accompanying 
noise as well as model uncertainties directly into the resulting processor. In this way 
the model-based processor ( MBP ) enables the interpretation of results directly in terms 
of the problem physics. It is actually a modeler’s tool enabling the incorporation of any 
a priori information about the problem to extract the desired information. The fidelity 
of the model incorporated into the processor determines the complexity of the model- 
based processor with the ultimate goal of increasing the inherent signal-to-noise ratio 
(SNR). These models can range from simple, implicit, non-physical representation 
of the measurement data such as the Fourier or wavelet transforms to parametric 
black-box models used for data prediction, to lumped mathematical representation 
characterized by ordinary differential equations, to distributed representations char¬ 
acterized by partial differential equation models to capture the underlying physics of 
the process under investigation. The dominating factor of which model is the most 
appropriate is usually determined by how severe the measurements are contaminated 
with noise and the underlying uncertainties. If the SNR of the measurements is high, 
then simple non-physical techniques can be used to extract the desired information; 
however, for low SNR measurements more and more of the physics and instrumen¬ 
tation must be incorporated for the extraction. For instance, consider the example of 
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Signal extraction 


FIGURE 1.4 Model-based approach to signal processing: process (chemistry 
and physics), measurement (microcantilever sensor array) and noise (Gaussian) 
representations. 


detecting the presence of a particular species in a test solution using a microcantilever 
sensor measurement system [4], 


Example 1.2 

The model-based processing problem is characterized in Fig. 1.4 representing the 
process of estimating the presence of a particular species of material in solution using 
the multichannel microcantilever sensor system. Here the microcantilever sensor is 
pre-conditioned by depositing attractor material on its levers to attract molecules 
of the target species. Once calibrated, the test solution flows along the levers with 
the target molecules attracted and deposited on each “tuned” microcantilever creating 
a deflection that is proportional to the concentration. This deflection is measured using 
a laser interferometric technique and digitized for processing. The process model is 
derived directly from the fluidics, while the measurement model evolves from the 
dynamics of the microcantilever structure. The resulting processor is depicted in 
Fig. 1.5, where we note the mathematical models of both the process dynamics and 
microcantilever measurement system. Since parameters, 0, of the model are unknown 
a priori calibration data is used to estimate them directly and then they are employed 
in the MBP to provide the enhanced signal estimate shown in the figure. Even though 
nonlinear and non-Gaussian, the processor appears to yield reasonable estimates. See 
Sec. 10.3 [4] for details. AAA 

The above example demonstrates that incorporating reasonable mathematical 
models of the underlying phenomenology can lead to improved processing capability; 
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FIGURE 1.5 Model-based processor representation of species detection problem: process 
(concentration model), measurement (microcantilever sensor array), raw data, parameter 
estimator (coefficients) and model-based processor (enhancement). 
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however, even further advantages can be realized by combining the MBP concepts in 
conjunction with Bayesian constructs to generalize solutions. 

Combining Bayesian and model-based signal processing can be considered a para¬ 
metric representation of the required distributions using mathematical models of the 
underlying physical phenomenology and measurement (sensor) system. Certainly, if 
we assume the distribution is Gaussian and we further constrain the processes to be 
Markovian (only depending on the previous sample), then the multivariate Gaussian 
can be completely characterized using state-space models resulting in the well-known 
Kalman filter in the linear model case [2], 

Since we are primarily concerned with pseudo real-time techniques in this text, we 
introduce the notion of a recursive form leading to the idea of sequential processing 
techniques. That is, we investigate “recursive” or equivalently “sequential” solutions 
to the estimation problem. Recursive estimation techniques evolved quite naturally 
during the advent of the digital computer in the late fifties, since both are sequential 
processes. It is important to realize the recursive solution is identical to the batch 
solution after it converges, so there is no gain in estimator performance properties; 
however, the number of computations is significantly less than the equivalent batch 
technique. It is also important to realize that the recursive approach provides the 
underlying theoretical and pragmatic basis of all adaptive estimation techniques; 
thus, they are important in their own right [2]! 

Many processors can be placed in a recursive form with various subtleties emerging 
in the calculation of the current estimate ( X 0 if). The standard technique employed 
is based on correcting or updating the current estimate as a new measurement data 
sample becomes available. The estimates generally take the recursive form: 

Xnew=X old + KE new (1.5) 


where 


E new = Y-Y 0 i d = Y- CX oM 

Here we see that the new estimate is obtained by correcting the old estimate with 
a K-weighted error. The error term E new is the new information or innovation—the 
difference between the actual and the predicted measurement ( Ym ) based on the old 
estimate (X 0 jd). The computation of the weight matrix K depends on the criterion 
used (e.g., mean-squared error, absolute error, etc.). 

Consider the following example, which shows how to recursively estimate the 
sample mean. 

Example 1.3 

The sample mean estimator can easily be put in recursive form. The estimator is 
given by 

*w = ^X>> 
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Extracting the N ,h term from the sum, we obtain 


1 1 N ~ X 

X(N)=-y(N)+-J2y(t) 


Identify X(N — I) from the last term, 

1 N- K 

X(N) = -y(N ) + ~^~X(N - 1) 

The recursive form is given by 

X(N) =X(N- 1) + i \y(N)-X(N- I)] 

AEW OLD ERROR 

This procedure to develop the “recursive form” is very important and can be applied 
to a multitude of processors. Note the steps in determining the form: 

1. Remove the /V tf '-term from the summation; 

2. Identify the previous estimate in terms of the N — 1 remaining terms; and 

3. Perform the algebra to determine the gain factor and place the estimator in the 

recursive form of Eq. 1.5 for a scalar measurement. AAA 


1.5 NOTATION AND TERMINOLOGY 

The notation used throughout this text is standard in the literature. Where necessary, 
vectors are represented by boldface, lowercase, x, and matrices by boldface, upper¬ 
case, A. We denote the real part of a signal by Re x and its imaginary part by Im x. 
We define the notation N to be a shorthand way of writing 1,2,..., N. It will be 
used in matrices, A(N) to mean there are /V-columns of A. As mentioned previously, 
estimators are annotated by the caret, such as x. We also define partial derivatives at 
the component level by the A^-gradient vector by V# and higher order partials 
by Vj. 

The most difficult notational problem will be with the “time” indices. Since this 
text is predominantly discrete-time, we will use the usual time symbol, t to mean 
a discrete-time index, that is, t e I for I the set of integers. However, and hopefully 
not too confusing, t will also be used for continuous-time, that is, t e 1Z for 1Z the set 
of real numbers denoting the continuum. When used as a continuous-time variable, 
t e 1Z it will be represented as a subscript to distinguish it, that is, .q. This approach of 
choosing t e X primarily follows the system identification literature and for the ease 
of recognizing discrete-time variable in transform relations (e.g., discrete Fourier 
transform). The rule-of-thumb is therefore to “interpret t as a discrete-time index 




1.5 NOTATION AND TERMINOLOGY 13 


unless noted by a subscript as continuous in the text.” With this in mind we will 
define a variety of discrete estimator notations as x(t\t — 1) to mean the estimate at 
time (discrete) t based upon all of the previous data up to t — 1. We will define these 
symbols prior to their use within the text to assure no misunderstanding of its meaning. 

With a slight abuse of notation, we will use the terminology distribution of X, 
Pr(X) in general, so as not to have to differentiate between density for continuous 
random variables or processes and mass for discrete variates. It will be obvious from 
the context which is meant. In some cases, we will be required to make the distinction 
between cumulative distribution function (CDF) and density (PDF) or mass (PMF) 
functions. Here we use the uppercase notation, Px(x) for the CDF and lower case 
Px(x) for the PDF or PMF. 

Subsequently we will also need to express a discrete PMF as a continuous PDF 
using impulse or delta functions as “samplers” much the same as in signal process¬ 
ing when we assume there exists an impulse sampler that leads to the well-known 
Nyquist sampling theorem [2]. Thus, corresponding to a discrete PMF we can define 
a continuous PDF through the concept of an impulse sampler , that is, given a discrete 
PMF defined by 


Px(x) ** p(X = xt) = ^2 pi S(x - xi) (1.6) 


then we define the equivalent continuous PDF as px(x). Moments follow from the 
usual definitions associated with a continuous PDF, for instance, consider the defi¬ 
nition of the expectation or mean. Substituting the equivalent PDF and utilizing the 
sifting property of the impulse function gives 


E{x} = J xp x (x)dx = J x^2p i S(x-x i )\dx = Y^XiP i (1.7) 


which is precisely the mean of the discrete PMF. 

Also, as mentioned, we will use the symbol ~ to mean “distributed according to” 
as in x~N(m, v) defining the random variable x as Gaussian distributed with mean 
m and variance v. We may also use the extended notation: Nix: m, v) to include 
the random variable x as well. When sampling we use the non-conventional right 
arrow “action” notation —> to mean “draw a sample from” a particular distribution 
such as x,- —> Pr(x)—this again will be clear from the context. When resampling, that 
is, replacing samples with new ones we use the “block” right arrow such as xj => x, 
meaning new sample xj replaces current sample x,-. 

Finally in a discrete (finite) probabilistic representation, we define a purely discrete 
variate as x*(t) := Pr(x(t) = A*) meaning that x can only take on values (integers) k 
from a known set A 1 = [X\,..., A*,..., X^} at timet. We also use the symbol, A A A 
to mark the end of an example. 
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MATLAB NOTES 

MATLAB is command oriented vector-matrix package with a simple yet effective 
command language featuring a wide variety of embedded C language con¬ 
structs making it ideal for signal processing applications and graphics. All of 
the algorithms we have applied to the examples and problems in this text are 
MATLAB -based in solution ranging from simple simulations to complex appli¬ 
cations. We will develop these notes primarily as a summary to point out to the 
reader many of the existing commands that already perform the signal processing 
operations discussed in the presented chapter and throughout the text. 


REFERENCES 

1. J. Candy, Signal Processing: The Model-Based Approach (New York: McGraw-Hill, 
1986). 

2. J. Candy, Model-Based Signal Processing (Hoboken, NJ: Wiley/IEEE Press, 2006). 

3. R. Duda, P. Hart and D. Stork, Pattern Classification (Hoboken, NJ: Wiley/IEEE Press, 

2001 ). 

4. J. Tringe, D. Clague, J. Candy and C. Lee, “Model-based signal processing of 
multichannel cantilever arrays,” IEEE J. Micromech. Syst., 15, 5, 1371-1391, 2006. 

5. S. Ulam, R. Richtmyer and J. von Neumann, “Statistical methods in neutron diffusion,” 
Los Alamos Scientific Laboratory Report, LAMS-551, 1947. 

6. N. Metropolis and S. Ulam, “The Monte Carlo method,” J. American Stat. Assoc., 44, 
335-341, 1949. 

7. N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller and E. Teller, “Equations of state 
calculations by fast computing,” J. Chemical Physics, 21, 6, 1087-1091, 1953. 

8. W. Hastings, “Monte Carlo sampling methods using Markov chains and their applica¬ 
tions,” Biometrika, 57, 1, 97-109, 1970. 

9. N. Metropolis, “The beginning of the Monte Carlo method,” Los Alamos Science, Special 
Issue, 125-130, 1987. 

10. J. Liu, Monte Carlo Strategies in Scientific Computing (New York: Springer-Verlag, 2001). 

11. D. Frenkel, “Introduction to Monte Carlo methods,” in Computational Soft Matter: From 
Synthetic Polymers to Proteins, N. Attig, K. Binder, H. Grubmuller and K. Kremer (Eds.) 
J. von Neumann Instit. for Computing, Julich, NIC Series, Vol. 23, pp. 29-60, 2004. 

12. C. Robert and G. Casella, Monte Carlo Statistical Methods (New York: Springer, 1999). 

13. M. Tanner, Tools for Statistical Inference: Methods for the Exploration of Posterior 
Distributions and Likelihood Functions, 2 nd Ed. (New York: Springer-Verlag, 1993). 

14. J. Ruanaidh and W. Fitzgerald, Numerical Bayesian Methods Applied to Signal Processing 
(New York: Springer-Verlag, 1996). 

15. W. Gilks, S. Richardson and D. Spiegelhalter, Markov Chain Monte Carlo in Practice 
(New York: Chapman & Hall/CRC Press, 1996). 

16. A. Doucet, N. de Freitas and N. Gordon, Sequential Monte Carlo Methods in Practice 
(New York: Springer-Verlag, 2001). 




PROBLEMS 15 


17. B. Ristic, S. Arulampalam and N. Gordon, Beyond the Kalman Filter: Particle Filters for 
Tracking Applications (Boston: Artech House, 2004). 

18. O. Cappe, E. Moulines and T. Ryden, Inference in Hidden Markov Models (New York: 
Springer-Verlag, 2005). 

19. S. Godsill and P. Djuric, “Special Issue: Monte Carlo methods for statistical signal 
processing.” IEEE Trans. Signal Proc., 50, 173-499, 2002. 

20. R Djuric, J. Kotecha, J. Zhang, Y. Huang, T. Ghirmai, M. Bugallo and J. Miguez, “Particle 
Filtering,” IEEE Signal Proc. Mag., 20, 5, 19-38, 2003. 

21. S. Haykin and N. de Freitas, “Special Issue: Sequential state estimation: from Kalman 
filters to particle filters.” Proc. IEEE, 92, 3, 399-574, 2004. 

22. A. Doucet and X. Wang, “Monte Carlo methods for signal processing,” IEEE Signal Proc. 
Mag. , 24, 5,152-170, 2005. 

23. J. Candy, “Bootstrap particle filtering for passive synthetic aperture in an uncertain ocean 
environment.” IEEE Signal Proc. Mag., 24, 4, 73-85, 2007. 


PROBLEMS 

1.1 Estimate the number of times a needle when dropped between two parallel 
lines intersects a line. One was to accomplish this is experimentally by setting 
up the experiment and doing it—this is the famous Buffon’s needle experiment 
performed in 1725. 

(a) Set up the experiment and perform the measurements for 100 samples. 
Estimate the underlying probabilities. 

(b) Analyze the experiment using a “closed form” approach. 

(c) How do your answers compare? 

Note that this is one of the first Monte Carlo approaches to problem solving. 

1.2 Suppose we have three loaded dice with the following six “face” probabilities 
(each): 


D1 

D2 

D3 


| 1 1 1 111 } 

( 12’ 6’ 12’ 3’ 6’ 6 J 

11 1 1 1 11 } 

}6’6’6’12’12’3j 

f 1 1 1 1 11} 

}6’6’6’12’12’3j 


Applying Bayes’ rule, answer the following questions: 

(a) Selecting a die at random from the three, what is the probability of rolling 
a 6? 

(b) What is the probability that die two (D = D2) was selected, if a six (R = 6) 
is rolled with the chosen die? 
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1.3 A binary communication transmitter ( T ) sends either a 0 or a 1 through a 
channel to a receiver ( R ) with the following probabilities for each as: 


Pr(ri) = 0.6 
Pr(tf||7+) = 0.9 
Pr(/?i|7b) = 0.1 


Pr(T 0 ) = 0.4 
Pr(J?o|ri) = 0.1 
Pr(/?o|r 0 ) = 0.9 


(a) What is the probability that R\ is received? 

(b) What is the probability that Rq is received? 

(c) What is the probability that the true transmitted signal was a 1, when a 1 
was received? 

(d) What is the probability that the true transmitted signal was a 0, when a 0 
was received? 

(e) What is the probability that the true transmitted signal was a 1, when a 0 
was received? 


(f) What is the probability that the true transmitted signal was a 0, when a 1 
was received? 


(g) Draw a probabilistic directed graph with nodes being the transmitters 
and receivers and links being the corresponding prior and conditional 
probabilities? 


1.4 We are asked to estimate the displacement of large vehicles (semi-trailers) 
when parked on the shoulder of a freeway and subjected to wind gusts created 
by passing vehicles. We measure the displacement of the vehicle by placing 
an accelerometer on the trailer. The accelerometer has inherent inaccuracies 
which is modeled as 


with y, x, n the measured and actual displacement and white measurement noise 
of variance R nn and K a the instrument gain. The dynamics of the vehicle can 
be modeled by a simple mass-spring-damper. 

(a) Construct and identify the measurement model of this system. 

(b) Construct and identify the process model and model-based estimator for 
this problem. 

1.5 Think of measuring the temperature of a liquid in a beaker heated by a burner. 
Suppose we use a thermometer immersed in the liquid and periodically observe 
the temperature and record it. 

(a) Construct a measurement model assuming that the thermometer is lin¬ 
early related to the temperature, that is, y(t) = k AT(t). Also model the 
uncertainty of the visual measurement as a random sequence v(t) with 
variance R vv . 
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(b) Suppose we model the heat transferred to the liquid from the burner as 
Q(t) = CA A 7X0 

where C is the coefficient of thermal conductivity, A is the cross-sectional area, 
and A 7X0 is the temperature gradient with assumed random uncertainty w(t) 
and variance R ww . Using this process model and the models developed above, 
identify the model-based processor representation. 

1.6 We are given an RLC series circuit driven by a noisy voltage source V)„(t) and 
we use a measurement instrument that linearly amplifies by K and measures the 
corresponding output voltage. We know that the input voltage is contaminated 
by and additive noise source, w(t) with covariance, R ww and the measured 
output voltage is similarly contaminated with noise source, v(t) with R vv . 

(a) Determine the model for the measured output voltage, V ou t(t ) (measure¬ 
ment model). 

(b) Determine a model for the circuit (process model). 

(c) Identify the general model-based processor structures. In each scheme, 
specify the models for the process, measurement and noise. 

1.7 A communications satellite is placed into orbit and must be maneuvered using 
thrusters to orientate its antennas. Restricting the problem to the single axis 
perpendicular to the page, the equations of motion are 



where J is the moment of inertia of the satellite about its center of mass, 
T c is the thruster control torque, Tj is the disturbance torque, and 0 is the 
angle of the satellite axis with respect to the inertial reference (no angular 
acceleration) A. Develop signal and noise models for this problem and identify 
each model-based processor component. 

1.8 Consider a process described by a set of linear differential equations 


d 2 c 
dt 2 


dc 

~T +C 

dt 


= Km 


The process is to be controlled by a proportional-integral-derivative (PID) 
control law governed by the equation 


m = ^( e+ U““ +T - d i) 
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and the controller reference signal r is given by 


Suppose the reference is subjected to a disturbance signal and the measurement 
sensor, which is contaminated with additive noise, measures the “square” of 
the output. Develop the model-based signal and noise models for this problem. 


1.9 


The elevation of a tracking telescope is controlled by a DC motor. It has a 
moment of inertia J and damping B due to friction, the equation of motion is 
given by 


2_o 

dt 2 


where T m and Tj are the motor and disturbance torques and 9 is the elevation 
angle. Assume a sensor transforms the telescope elevation into a proportional 
voltage that is contaminated with noise. Develop the signal and noise models 
for the telescope and identify all of the model-based processor components. 

1.10 Suppose we have a two-measurement system given by 


y = 


1 ^ 


where /?„„ = diag[l, 0.1]. 

(a) What is the batch least-squares estimate (W = I) of the parameter x, if 

y = [7 21|'? 

( b ) What is the batch weighted least-squares estimate of the parameter x with 
W selected for minimum variance estimation? 

1.11 Calculate the batch and sequential least-squares estimate of the parameter 
vector x based on two measurements y(l) and y(2) where 


y(l) = C(l)x + u(l) = 
y( 2) = c'x + v(2) = 4 

C = [o !]' C ' (1) = [1 2] > W = 1 
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BAYESIAN ESTIMATION 


2.1 INTRODUCTION 

In this chapter we motivate the idea of Bayesian estimation from probabilistic perspec¬ 
tive, that is, we perform the required estimation using the underlying densities or mass 
functions. We start with the “batch” approach and evolve to the Bayesian sequential 
techniques. We discuss the most popular formulations: maximum a posteriori {MAP), 
maximum likelihood (ML), minimum variance {MV) or equivalently minimum 
mean-squared error {MMSE) and least-squares (LS) methods. Bayesian sequential 
techniques are then developed. The main idea is to develop the proper perspective for 
the subsequent chapters and construct a solid foundation for the techniques to follow. 

2.2 BATCH BAYESIAN ESTIMATION 

Suppose we are trying to estimate a random parameter X from data Y — y. Then the 
associated conditional density Pr(X|Y = y) is called the posterior density because 
the estimate is conditioned “after {post) the measurements” have been acquired. 
Estimators based on this a posteriori density are usually called Bayesian because 
they are constructed from Bayes’ theorem, since Pr(X| Y) is difficult to obtain directly. 
That is, Bayes ’ rule is defined 

Pr(Z|F):=Pr(F|Y)^J (2.1) 

where Pr(Y) is called the prior density (before measurement), Pr(F|X) is called the 
likelihood (more likely to be true) and Pr(F) is called the evidence (normalizes the 

Bayesian Signal Processing. By James V. Candy 
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posterior to assure its integral is unity). Bayesian methods view the sought after 
parameter as random possessing a “known” a priori density. As measurements are 
made, the prior is converted to the posterior density function adjusting the parameter 
estimates. Thus, the result of increasing the number of measurements is to improve 
the a posteriori density resulting in a sharper peak closer to the true parameter as 
depicted in Fig. 1.1. 

To solve the estimation problem, the first step requires the determination of the 
a posteriori density. A logical solution to this problem leads us to find the “most 
probable” value of Pr(A | Y )—its maximum [1], The maximum a posteriori (MAP) 
estimate is the value of x that maximizes the posterior density, that is, 

max Prl'A'I Y) (2.2) 

The optimization is carried out in the usual manner by differentiating, setting the 
result to zero and solving to obtain the MAP equation 

VxPr(X|T)| z=JW =0 (2.3) 

with the gradient vector e R N * X 1 defined by 



Because many problems are based on the exponential class of densities, 
the In Pv(X\ Y) is considered instead. Since the logarithm is a monotonic function, 
the maximum of Pr(Aj Y) and In Pr(Xj Y) occur at the same value of X. Therefore, the 
logarithmic MAP equation is 

V x \ n Pr(X\Y)\ x=kMAp =0 (2.5) 

Now if we apply Bayes’ rule to Eq. 2.5, then 

In Pv(X | Y) = In Pr(T |Z) + In Pr(X) - In Pr(T) (2.6) 

Since Pr( Y) is not a function of the parameter X, the MAP equation can be written 
succinctly as 

V z In Pr(X| Y) | x=w = V z (ln Vv(Y\X) + In PrfA))| x= ^ p = 0 (2.7) 

With a variety of estimators available, we must construct some way of ascertaining 
performance. The quality of an estimator is usually measured in terms of its estimation 
error. 


X = X-X 


(2.8) 
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A common measure of estimator quality is called the Cramer-Rao lower bound 
(CRLB ). The CRLB offers a means of assessing estimator quality prior to process¬ 
ing the measured data. We restrict discussion of the CRLB to the case of unbiased 
estimates, X, of a “non-random” parameter X. The bound is easily extended to more 
complex cases for biased estimates as well as random parameters [2,3]. The Cramer- 
Rao lower bound 1 for any unbiased estimate X of X based on the measurement, F, is 
given by 

% |X = cov(X - X(F)|X = x) > I -1 (2.9) 

where I the Nx x Nx information matrix defined by 

1 := -E y { V Z (V X In Pr(F|X))'} (2.10) 

with the gradient vector defined above. Any estimator satisfying the CRLB with 
equality is called efficient. The bound is easily calculated using the chain rule from 
vector calculus [5] defined by 

S7 x (a'b) := (V x a')b + (Vxb')a a,b e R NxXl (2.11) 

where a, b are functions of X. Consider the following example illustrating the 
calculation of the CRLB. 

Example 2.1 

Suppose we would like to estimate a nonrandom but unknown parameter, X, from a 
measurement y contaminated by additive Gaussian noise, that is, 

y = X + v 

where v ~ Iff), R vv ) and X is unknown. Thus, we have that 
£{F|X} = E{X + p|X} = X 


and 


which gives 


var(F|X) = E{(y - E{Y\X}f\X} = E{v 2 \X] = R vv 
Pr(F|X) ~ Jf{X,R vv ) 


and therefore 

1 1 (y - X) 2 

lnPr(F|X) = -- ln(2 nR vv ) - ^ R 


1 We choose the matrix-vector version, since parameter estimators are typically vector estimates. 
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Differentiating according to the chain rule of Eq. 2.11 and taking the expectation we 
obtain 

1= -E{-^lnPr(E|X)l = -E { — ()> ~ Z) 

I 3X 2 V ' J ( 3X Eyy 

and therefore the CRLB is 

Rx\x - R w AAA 

The utility of the CRLB is twofold: (1) it enables us to measure estimator quality 
because it indicates the “best” (minimum error covariance) that any estimator can 
achieve, and (2) it allows us to decide whether or not a designed estimator is effi¬ 
cient, that is, any estimator achieving the CRLB with equality is efficient —a desirable 
statistical property. 

In summary, the properties of an estimator can be calculated prior to estimation 
(in some cases), and these properties can be used to answer the question “how well 
does this estimator perform”. Next we consider the case when the parameter X is not 
random leading to the maximum likelihood estimator. 



2.3 BATCH MAXIMUM LIKELIHOOD ESTIMATION 

In contrast to the Bayesian approach, the likelihood method views the parameter as 
deterministic but unknown. We include it in the Bayesian discussion because as we 
will show both estimators are in fact intimately linked. Maximum likelihood produces 
the “best” estimate as the value which maximizes the probability of the measurements 
given that the parameter value is “most likely” true. In the estimation problem the 
measurement data are given along with the underlying structure of the probability 
density function (as in the Bayesian case), but the parameters of the density are 
unknown and must be determined from the measurements; therefore, the maximum 
likelihood estimate can be considered heuristically as that value of the parameter that 
best “explains” the measured data giving the most likely estimation. 

More formally, let X be a vector of unknown parameters, X e R NxX 1 and the corre¬ 
sponding set of N-conditionally independent measurements, Y(N) := (y(l) • • • y (N)} 
for y eR N y xl . The likelihood of X given the measurements is defined to be propor¬ 
tional to the value of the probability density of the measurements given the parameters, 
that is, 

N 

C(Y(Ny,X) oc Pv(Y(N)\X) = Pr(y(l)... y(N)\X) = f[ Pr(y(z)|X) (2.12) 


where £ is the likelihood function and Pr(T|X) is the joint probability density func¬ 
tions of the measurements given the unknown parameters. This expression indicates 
the usefulness of the likelihood function in the sense that in many applications mea¬ 
surements are available and are assumed drawn as a sample from a “known” or 
assumed known probability density function with unknown parameters (e.g., Poisson 
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with unknown mean). Once we have the measurements (given) and the likelihood 
function, then we would like to find the best estimate of the parameters. If we search 
through parameter space over various values of X, say A), then we select the value of 
X that most likely specifies the underlying probability function that the measurement 
sample was drawn from, that is, suppose we have two estimates, A) and Xj for which 

Pr(X|X;) > Pr(Y\Xj) (2.13) 

Thus, it is “more likely” that the Y(N) were drawn for parameter value A) than Xj, 
since, equivalently, C(Y ; X,) > C(Y ; Xj). Searching over all X and selecting that value 
of X that is maximum (most probable) leads to the maximum likelihood estimate {ML) 
given by 

Xml(Y) = arg max Pr(Y\X) (2.14) 

As noted previously, many problems are characterized by the class of exponential 
densities for the Bayesian estimator making it more convenient to use the natural 
logarithm function; therefore, we define the log-likelihood junction as 

A{Y(N)\X) := In C{Y;X) = In Pr(T(A)|X) (2.15) 

Since the logarithm is monotonic, it preserves the maximum of the likelihood 
providing the identical result, 

Xml(Y) = arg max lnPr(X|X) (2.16) 

What makes the ML estimator popular is the fact that it enjoys some very desirable 
properties that we list without proof (see [2] for details). 

1. ML estimates are consistent. 

2. ML estimates are asymptotically efficient with R x ^ x =l~ l . 

3. ML estimates are asymptotically Gaussian with Af(X, R xx ). 

4. ML estimates are invariant, that is, if Xml , then any function of the ML estimate 
is the ML estimate of the function,/^ =/(A'ml)- 

5. ML estimates of the sufficient statistic are equivalent to the ML estimates over 
the original data. 

These properties are asymptotic and therefore imply that a large amount of data must 
be available for processing. 

Mathematically, the relationship between the MAP and ML estimators is clear 
even though philosophically they are quite different in construct. If we take the MAP 
equation of Eq. 2.6 and ignore the a priori distribution Pr(A) (assume X is unknown 
but deterministic), then the maximum likelihood estimator is only a special case of 
MAP. Using the same arguments as before, we use the In Pr(X| Y) instead of Pr(Aj X). 
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We obtain the maximum likelihood estimate by solving the log-likelihood equation 
and checking for the existence of a maximum; that is, 

V*biPr(X|y)| z=iMt = 0 (2.17) 

Of course, to check for a maximum we have that Vx(Vx In Pr(X\ Y)') < 0. Again 
applying Bayes’ rule as in Eq. 2.1 and ignoring Pr(X), we have 

V x In Pr(A| Y) = V* In Pr(K|A)|^ w = 0. (2.18) 

Consider the following example to demonstrate this relationship between MAP 
and ML. 


Example 2.2 

Consider estimating an unknown constant, from a noisy measurement as in the pre¬ 
vious example. Further assume that the noise is an independent Gaussian random 
variable such that v ~ N{ 0, R vv ) and the measurement model is given by 

y = X + v (2.19) 

First, we assume no a priori statistical knowledge of X just that it is an unknown, 
nonrandom constant. Thus, we require the maximum likelihood estimate, since no 
prior information is assumed about Pr(X). The associated conditional density is 


Pr(T|X) 



( 2 . 20 ) 


The maximum likelihood estimate of X is found by solving the log-likelihood 
equation: 


V z lnPr(T|A)| z=iMt =0 (2.21) 


a 

dX 


lnPr(T|X) 


a 

dX 

1 

R^v 


I — In 2 jtR vv 

1 2 

(y-x) 



Setting this expression to zero and solving for X, we obtain 

Xml = y (2.22) 

The best estimate of X in a maximum likelihood sense is the raw data y. The 
corresponding error variance is easily calculated (as before) 


(2.23) 
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Next we model X as a random variable with Gaussian prior, that is, X ~ N(X, Rxx) 
and desire the maximum a posteriori estimate. The MAP equation is 


Jmap = V x (lnPr(T|X) + lnPr(X)) 

Jmap = | - 2 ln 2jlR vv - ~ X ^ 2 ~ 2 ln 2nRxx 


;(X-X) 2 | 
Rxx } 


Setting this expression 


o and solving for X — Xmaf 



It can be shown from Eq. 2.9 that the corresponding error variance is 

Rvv 

R x\x ~ , Rvv 

1 + m 

Examining the results of this example, we see that when the parameter variance is 
large (Rxx (>> Rvv), the MAP and ML estimates perform equivalently. However, when 
the variance is small, the MAP estimator performs better because the corresponding 
error variance is smaller. AAA 

The main point to note is that the MAP estimate provides a mechanism to incor¬ 
porate the a priori information, while the ML estimate does not. Therefore, for some 
problems, MAP is the efficient estimator. In the above example, if X were actually 
Gaussian, then the ML solution, which models X as an unknown parameter, is not 
an efficient estimator, while the MAP solution that incorporates this information by 
using the prior Pr(X) is efficient. 

This completes the introduction to batch Bayesian estimation using the maximum 
a posteriori and maximum likelihood estimation. Next we consider a very popular 
approach to solving maximum likelihood estimation problems. 

2.3.1 Expectation-Maximization Approach 
to Maximum Likelihood 

Solving maximum likelihood parameter estimation problems is a formidable task 
especially when the underlying probability distribution functions are unknown. There¬ 
fore, we must resort to numerical approaches that will successfully converge to the 
parameters of interest. Expectation-Maximization (EM) is a general method of deter¬ 
mining the maximum likelihood estimate of parameters of the underlying distribution 
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from a given set of data which is incomplete, that is, having “missing” (data) values 
[7], Missing data could be considered a misnomer; however, if we include “hidden 
variables” (not directly measured) as missing, then a wide variety of state/parameter 
estimation problems can be incorporated into these problem classes. Probably the 
most popular applications of the EM technique occur in tomographic image recon¬ 
struction, pattern recognition, communications and the training of hidden Markov 
models (see Chapter 9) for speech recognition [8, 9]. 

The EM technique produces maximum-likelihood parameter estimates in two 
steps: an expectation-step followed by a maximization-step. The expectation-step 
with respect to the unknown parameters uses the most recently available parameter 
estimate conditioned on the measurements, while the maximization-step provides an 
updated estimate of the parameters. 

To be more precise, we mathematically, formulate the general “missing data” 
problem by first defining the unknown parameters or variables to be estimated as 
0 € 1Z N(,X 1 with 0 e 0, the parameter space. We further define three distinct spaces 
for our problem: (1) the complete data space, Z ; (2) the incomplete data space, y ; and 
(3) the missing data space, X, where the complete space is the union: Z = (X, y). 
Analogously, we define the corresponding complete, incomplete and missing/hidden 
vectors as: z e 1Z N ’ x 1 , y e 1Z Nyx 1 and x e 7 Z Nxx 1 , respectively. 

With this in mind, we can now define the underlying distributions as the joint or 
complete distribution along with its Bayesian decompositions as 

Pr(z|6>) = Pr(x, y|0) = Pr(x|y, 0 ) x Pr(y|0) = Pr(y|x, 0) x Pr(x|0) (2.24) 

or taking logarithms, we have the complete (data) log-likelihood 

A CD (z\0) = lnPr(z|0) = lnPr(x,y|0) = lnPr(x|y,0) + A /D (y|0) (2.25) 

where A/o(y|0) = In Pr(y|0) is the corresponding incomplete (data) log-likelihood 
and the missing data is random with x ~ Pr(x). Since x is random, then so is Acd(z| 0) 
with y and 0 assumed fixed (constant). Thus, the basic maximum likelihood param¬ 
eter estimation problem for the complete data is to find the value of the parameter 
such that 

$i = argmax Acd(z|0) 

given the measured or observed (incomplete) data, y, and the previous parameter 
estimate, 0 = 0. However, since A cd is random, we must search for the parameter 
vector that maximizes its expectation (over x), that is, we seek 

0i = argmax E x {Acd(z\0)} (2.26) 


given the measured data, y, and the current parameter estimate, 0. Multiplying both 
sides of Eq. 2.25 by the underlying marginal posterior distribution of the missing 
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data, Pr(x|y, 0) and summing over x, we obtain 

J2 A co(m x Pr(x|y, 6) = J]lnPr(z|0) x Pr(x|y,0) 

= J]lnPr(x|y,60xPr(x|y,60 
+ ^> /D (y|0)xPr(x|y,0) 


Using the definition of the conditional expectation and recognizing that the last term 
is not a function of the random vector x, we obtain 

E x { AcD(z|6>)ly, Q} = Ax {In Pr(x|y, 9)} + A ro (y| 9) (2.27) 

Since we do not know the complete data, we cannot calculate the exact log- 
likelihood for this problem. But, given the measured data y, we can estimate the 
posterior probability for the missing (data) variables, x. For each x, there exists a 9, 
and therefore we can calculate an expected value of the complete log-likelihood. 

The basic EM principle is to find the 9 that maximizes Pr(z|0) using the available 
data y and current parameter estimate. Let 0,_i be the current parameter estimate, 
then the complete log-likelihood is given by the expectation-step: 

E-step: Q{9,9i-i)-=E x {h CD (i.\9)\y,9i-i} (2.28) 

for 9 the new parameter vector to be optimized in the next step. In this expres¬ 
sion the y and 9 are assumed fixed (constant) and x is the random vector such that 
x~Pr(x|y,0;_i) so that 

E x {h C D(m\yA-\} = ^lnPr(x|y,0)Pr(x|y,0 ; _i) (2.29) 


where Pr(x|y, 9,-- 1) is the marginal of the missing data (hidden variable) based on the 
measured data and current parameter estimate (as shown). 

The maximization-step is to find the parameter vector update, 0, that will maximize 
the computed expectation, that is, 

M-step: 6i = argrnax Q(6,6i-\) (2.30) 

Each iteration is guaranteed to increase the log-likelihood eventually converg¬ 
ing to a local maximum at an exponential rate [10, 11]. Different forms of the 
EM have evolved with the “generalized” EM (GEM) method by finding an alter¬ 
native (simpler) expectation function and using the updated parameter vector such 
that Q(9i, 9i- 1) > Q(9, 0,_ i) which is also guaranteed to converge. The Monte Carlo 
EM (MCEM) is another form that is used when a closed form distribution for the 
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E-step is replaced by simulation-based methods (sampling) [12], Consider the 
following example from Snyder [16]. 

Example 2.3 

Suppose we would like to estimate the rate or intensity parameter, X s , of a signal 
contaminated in Poisson noise with known mean, X v . We measure the counts from a 
photodetector characterized by 


where y„ is the observation with Poisson distribution 

Yo ~ V(Xy) = (X y Y- e~ x y/y n \ 

The respective signal and noise counts during the measurement period are independent 
of each other with s n ~ V(X S ) and v n ~ V(X v ). We have that the log-likelihood of the 
incomplete data is 

A//>(y„|A y ) = lnP(y„|k y ) = y n lnk y - k y - lny„! 

Differentiating A id with respect to X s , setting to the result to zero and solving for 
the rate parameter, that is, 

d v,, 

—(y„ ln(k, + X v ) - (X s + X v ) - lny„!) = —-1 = 0 
dX s X s + X v 

which yields the maximum likelihood parameter estimate 

X s = y„ - X v > 0 

Here we have used the fact that the superposition of Poisson processes is Poisson 
with intensity parameter, X y = X s + X v . Next, let us investigate the EM approach to 
this problem. Here the complete data space is z n = (s n > v n) and y n is the incomplete 
(observed) data. Therefore, we have that the complete log-likelihood is 

AcD(z«|k s ) = — (X s + X v ) + s n lnk s + v n In ku — Ins,,! — In v n \ 

because s n and v n are independent. Thus, the expectation-step is given by 

E-step: £(ks|ks(i - 1)) = -k s + s„(i - l)lnk, - X v + v n (i - l)lnk„ 


s n (i — 1) = E{s n \y n ,X s (i — 1)} = 


Ui ~ 1 ) 

X s (i - 1) + X v 


y n 


with 
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and 

v„(i - 1) = E[v n \y n , X„(i - 1)} = y„- s n (i - 1) 

Since the maximization-step does not depend on X v or v„(i — 1), we have 

X s (i) = argmax(— X s + s n (i — l)lnA. s ) 

giving 

M-step: X s (i) = s n (i — 1) 
which completes the EM algorithm. 

We simulated a sequence of Poisson counts for 500 samples composed of the 
additive signal ( X s = 14) and noise = 3.5). The estimated signal intensity at 
each measurement is shown in Fig. 2.1a along with the estimated probability mass 




FIGURE 2.1 EM estimation for Poisson Intensity: (a) Signal intensity estimate (Mean = 
14.02). (b) PDF estimation of Poisson processes: Noise, Signal, EM Estimate, Measurement. 
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functions in b of the True signal, EM estimated signal, measurement and noise. 
Here we see the estimated signal PDF closely matches the true PDF and the average 
intensity is very close to the true signal intensity at X s = 14.02. AAA 

In summary, the EM approach is a two-step, iterative technique to solve the maxi¬ 
mum likelihood parameter estimation problem when missing data or hidden variables 
(states) are present. Clearly, we have just discussed the generic form of the EM 
approach, the actual algorithm is problem specific based on the knowledge of the 
underlying distributions; however, we can consider a class of distributions to develop 
a more specific algorithm such as the so-called exponential family [13]. 

2.3.2 EM for Exponential Family of Distributions 

In this subsection we apply the EM technique to the exponential class of distri¬ 
bution functions many of which are well-known forms like the Gaussian, Poisson, 
Exponential, Raleigh, Binomial and more. The exponential family is defined by the 
generic form: 

Pr(z|6>) = b(z) exp(c'(6>)s(z))/a(60 (2.31) 

with 6 € 1Z Ne , the parameters defining the class [7, 14, 15] and s(z) the sufficient 
statistic providing all of the information necessary about the underlying process for 
estimation with s, c € lZ Ns x 1 . Since the complete log-likelihood can be written as 

Acd(z\0) = In Pr(z|0) = In b(z) + c'(0)s(z) — In a{9) (2.32) 

then taking the conditional expectations, we obtain the expectation-step as: 

E-step: Q(0,9i-i) = E x {lnb(z)|y,6») + c'(6»)£{s(z)|y,0,_i j - In a(6) (2.33) 

Defining s,- :=£’{s(z)|y,the maximization-step with respect to 9, is per¬ 
formed on 

£{ln b(z)|y, 9) + c'(6»)s ; - In a(0) (2.34) 

Since the first term is not a function of 9, it need not be included. The EM technique 
for the exponential family is: 

E-step: §,• 

M-step: 9i — argmax c'(0)s ( - — In a(9) (2.35) 


completing the method. 

For instance, in the previous example, we have b(z) = (a v > v " ; exp(c's) = e~^y and 
a(9) = y n \. Consider the following example of producing an image by estimating the 
intensity of a Poisson process discussed in McLachlan [6]. 
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Example 2.4 

Photons emitted from a radioactive tracer are injected into tissue in order to create 
a single photon emission computed tomography (SPECT) image for medical diagnosis 
[6, 14-16]. The basic idea is to create an image that represents the photon emitted, 
counted by photon detectors and displayed as an image based on pixel counts. For 
simplicity we assume all photons are counted by the instrument. The emissions are 
assumed to be Poisson distributed with unknown intensities or emission densities 
corresponding to the Poisson rate parameters, X. Thus, the basic problem is: 

GIVEN a set of incomplete measurements (counts), {y n }, FIND the maximum likeli¬ 
hood estimate of the unknown emission densities, X, (rate parameter vector) assuming 
that each of these component intensities is constant within a pixel, that is, X m is 
constant for m = 1,..., M p . 

Let y„; n — 1,..., Nd be the number of counts measured by the « ,/! -detector, so 
mathematically the problem is to estimate the M p -intensity vector of emission densi¬ 
ties, X, from the A/</-measurement vector, y, assuming that the counts are conditionally 
independent such that Pr(y n \X y (ri)) is Poisson, that is. 


Jn 


V{X y {n)) = 


(X y (n)j yn «"*»<») 

y n '- 


For imaging considerations, it is assumed that individual photons are emitted 
within a given pixel and detected; therefore, we define x nm as the unobservable counts 
(missing data) assuming X is known and conditionally independent of x, such that, 


Xmn ~ V(X x (m, ri)) = 


where X x (m, n) = X m p mn , the emission density corresponding to the number of photons 
emitted in the m ,h -pixel detected by the n th -detector with emission probability, p mn —a 
marked Poisson process [16, 18]. 

The photon counter measurement at the n th -detector is the superposition of the 
photons emitted by the radionuclide at the m th -pixel; therefore, we have that 

M p 

* = J>- 


which is the sum of Poisson variates, so that 

M p M p 

Xy(n) = Y X x (m,n) = Y X mPmn 
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It has been shown [6] that the conditional probability of the missing data is binomial 
with parameter, y n and probability Pr(A . m p m „), that is, 


~ A )'n , Pr (X m p mn )) with Pr(A mPmn) — A mPmn j ^ ' A iPjn 


The EM approach for this problem is straightforward, since the underlying distri¬ 
butions are all in the exponential family [6]. The sufficient statistic is the data, x mn , 
which can be estimated. That is, since x mn is binomial, the conditional mean (Q-step) 
is given by 

/ M p 

x mn (i ~ 1) := E{x mn \y n , X m (i - 1)} = y n I X x (m, n) / ^ A , x (j, n) 

where X x (m, n) = A m (i — 1 )p mn . Next define the likelihood based on z = (x, y) assum¬ 
ing A is known and conditionally independent. Therefore, the complete likelihood is 
given by 

Ax, y|A) = Y\ e- > Jm - n> (A x (m, ri)) x ™/x mn \ 
and 

A(x, y|A) = ^ —A x (m,ri) + x nm In A x (m,ri) — lnx„ m ! 
or substituting for A x (m, n) and expanding, we have 

M P N d 

A(x, y|A) “APnm + X nm ln(A m p nm ) - \nx nm \ 


Differentiating this expression with respect to A m , setting to the result zero and 
solving gives the maximization-step as: 


N d 

A m (l) = ~ 1) 


Summarizing the E and M-steps for this problem are given by: 

E-step: <2(A m , A m (i - 1) = E x { A(x, y)|y, A m (i - 1))} 

M p 

= JrXnd ~ 1 )Pmn j A j(i - 1 )p jn 
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M-step: X m (i) = X m (i - 1) 


N d ( \ 

V"' J ynPmn 1 


More details can be found in [6, 14, 16]. AAA 

This completes our discussion of the EM approach to parameter estimation 
problems with incomplete data/parameters. Next we briefly describe one of the 
most popular approaches to the estimation problem—minimum variance (MV) or 
equivalently minimum mean-squared error ( MMSE ) estimation. 


2.4 BATCH MINIMUM VARIANCE ESTIMATION 

To complete this discussion evolving from the batch Bayesian perspective, we discuss 
the development of the minimum (error) variance (MV) or equivalently minimum 
mean-squared error (MMSE) estimator. The general techniques developed in this 
section can be applied to develop various model-based processors [1]. 

Suppose we are asked to obtain an estimate of a A^-parameter vector X from a 
set of noisy measurements characterized by the N y -measurement vector Y. We would 
like to construct an estimate that is best or optimal in some sense. The most natural 
criterion to consider is one that minimizes the error between the true parameter and 
its estimate based on the measured data. The error variance criterion defined by 

J(X) = E x {[X - X(Y)]'[X - X(Y)]\Y\ (2.36) 

where 

X is the true random N x -vector 
Y is the measured random N y -vector (data) and 
X is the estimate of X given Y 

whose minimization leads to the minimum variance estimator [1], Thus, if we 
minimize J(X) using the chain rule of Eq. 2.11, then we have that 

V*/(X) = EAV^iX - X(Y))'(X - X(F))|F} 

= E x {-(X - X(Y)) -(X- X(F))| F} 

= -2[E x {X-X(Y)\Y}] 

performing the conditional expectation operation gives 

V&J(Xi = ~2[E x {X\Y} - X(F)] (2.37) 

and setting this equation to zero and solving yields the minimum variance estimate as 
X MV =X(Y) = E x {X\Y} (2.38) 
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We check for the global minimum by examining the sign (positive) of the second 
derivative, and since 

V k (V k J(X))' = 21 (2.39) 

a unique minimum exists. Thus, the minimum variance estimator is the conditional 
mean. The associated error variance is determined by substituting for X(Y ) into 
Eq. 2.36 giving 


Jmv = Rx ,y = E x {(X ~ E{X\Y})'(X - E{X\Y})\Y] 

= E x {X'X\Y] - E 2 k {X\Y) (2.40) 

This estimator is linear, unconditionally and conditionally unbiased and pos¬ 
sesses general orthogonality properties. To see this we investigate the estimation 
error defined by 

X = X-X{Y) (2.41) 

Taking expectations, we have the fact that the estimator is unconditionally unbiased 

E x {X} = E x {X - Z(F)} = E x {X} - E x {E x {X\Y}} = E x {X} - E x {X} = 0 (2.42) 

as well as conditionally unbiased, since 

E x {X\Y] = E x {X - X(Y)\Y] = E x {X\Y} - E x {E x {X\Y}\Y} 

= E x {X\Y} - E x {X\Y) = 0 (2.43) 


Another important property of the minimum variance estimator is that the estimation 
error is orthogonal to any function, say/(-), of the data vector Y [4], that is, 

E xY {f(Y)X'} = 0 (2.44) 

Also, 

E x {f(Y)X'\Y\ = 0 

This is the well-known orthogonality condition. To see this we 

E x {f (Y)X' \Y] = E x {f(Y){X-X(J))’\Y} 

= f(Y)E x {(X-X(Y))'\Y} 

= f(Y)(E x {X\Y] - X(Y)') = 0 


(2.45) 

substitute for the error 


Taking the expectation over Y proves the unconditional result as well. Thus, the 
estimation error is orthogonal to any function of Y, a fact that is used in proofs 
throughout the literature for both linear and nonlinear estimators. 
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Let us now investigate the special case of the linear minimum variance estimator. 
The estimation error is orthogonal to all past data Y, that is, 

Exy{YX'} = 0 (2.46) 

or as before 

E x {YX'\Y} = 0 (2.47) 

This is the well-known minimum variance estimator results in the linear case [1], For 
a linear function of the parameter, we have that 

y = CX + v (2.48) 

where y, v, e ff A >' xl Je^ xl , and v is zero-mean, white with R vv . The 

mean-squared error criterion 

J{X) = E\X'X) (2.49) 

is minimized to determine the estimate. The minimization results in the orthogonality 
condition of Eq. 2.47 which we write as 

E{yX'} = E{yX'} - E{yX' MV } = 0 (2.50) 

for Xmv = KmvY > a linear function of the data vector. Substituting for y and X in this 
equation gives 

Kmv = RxxC'iCRxxC' + R vv )~ l (2.51) 

where Rxx is the covariance matrix of X. The corresponding quality is obtained as 

R xx = (R xx + c ’ R vv O' 1 (2.52) 

It is also interesting to note that the fundamental Wiener result [ 1 ] is easily obtained 
from the orthogonality condition of Eq. 2.47, that is, 

E{yX'} = E{yX'} - E{yy'}K' MV = R yX - RyyK' MV = 0 (2.53) 

which is called the discrete Wiener-Hopf equation. Solving for Kmv, we obtain the 
Wiener solution for a linear (batch) estimation scheme, that is, 

K MV = ExyRyy (2.54) 

Note that least-squares estimation is similar to that of minimum variance except that 
no statistical information (expectation removed) is assumed known about the process, 
that is, the least-squares estimator, yzA minimizes the sum-squared error criterion 

min J = y'y=Y] jf (2.55) 


for y=y — yLS- 
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This completes the introduction to batch minimum variance, maximum a posteriori 
and maximum likelihood estimation. Next we consider the sequential problem. 


2.5 SEQUENTIAL BAYESIAN ESTIMATION 

Modern statistical signal processing techniques evolve directly from a Bayesian per¬ 
spective, that is, they are cast into a probabilistic framework using Bayes’ theorem 
as the fundamental construct. More specifically, the information about the random 
signal, x(f), required to solve a vast majority of estimation/processing problems is 
incorporated in the underlying probability distribution generating the process. For 
instance, the usual signal enhancement problem is concerned with providing the 
“best” (in some sense) estimate of the signal at time t based on all of the data avail¬ 
able at that time. The filtering distribution provides that information directly in terms 
of its underlying statistics. That is, by calculating the statistics of the process directly 
from the filtering distribution the enhanced signal can be extracted using a variety of 
estimators like MAP, ML or MMSE accompanied by a set of performance statistics 
such as error covariances and bounds. Sequential methods to calculate these distribu¬ 
tions become extremely important in pragmatic problems where implementation and 
speed are an issue. Therefore from an engineering perspective, they are our primary 
focus. 

The roots of this theory are based on Bayesian estimation and in fact sequen¬ 
tial Bayesian estimation. We will see that many of our well-known techniques are 
easily cast into this unifying framework especially in the nonlinear signal process¬ 
ing area. The Bayesian algorithms that provide posterior distribution estimates are 
optimal; however, they are impossible to implement directly because they require 
integrations or summations with an infinite number of terms. We will develop 
the optimal Bayesian algorithms mathematically and perform the required calcu¬ 
lations in a sequential manner, but it must be realized that only under certain 
restricted circumstances can these actually be realized (e.g., the linear Gaussian 
case). Starting with Bayes’ theorem it is possible to show how this leads directly 
to a recursive or sequential estimation framework that is the foundation of these new 
approaches. 

We cast this discussion into a dynamic variable/parameter structure by defining 
the “unobserved” signal or equivalently “hidden” variables as the set of ./V*-vectors, 
{x(t)},t = 0, ... ,N. On the other hand, we define the observables or equivalently 
measurements as the set of /Vy-vectors, {>’(?)}, t = 0,... ,N considered to be con¬ 
ditionally independent of the signal variables. The goal in recursive Bayesian 
estimation is to sequentially (in-time) estimate the joint posterior distribution, 
Pr(x(0),... ,x(A0;y(0),.. .,y(N)). Once the posterior is estimated than many of the 
interesting statistics characterizing the process under investigation can be exploited 
to extract meaningful information. 

We start by defining two sets of random (vector) processes: X t := (x(0),... ,x(f)} 
and Y t := (y(0),..., y(t)}, as before. Here we can consider X, to be the set of dynamic 
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random variables or parameters of interest and Y t as the set of measurements or 
observations of the desired process as before. 2 In any case we start with Bayes’ 
theorem for the joint posterior distribution as 


Pr(X f \Y t ) = Pr(Y t \X l )x (2.56) 

In Bayesian theory (as before), the posterior defined by Pr(Z r | Y t ) is decomposed 
in terms of the prior Pr(X f ), its likelihood Pr(Y t \X t ) and the evidence or normalizing 
factor, Pr(F r ). Each has a particular significance in this construct which we shall 
discuss subsequently. 

We can replace the evidence by realizing that it is the total probability given by 

Pr(Ef) = J Pr(Y t \X,)Pr(X t )dX t (2.57) 

which is integrated over the entire ^-dimensional parameter space. Substituting this 
into Eq. 2.56 we obtain 


Pr(X,\Y t ) = 


Pr(Y t \X t ) x Pr(Aj) 
fPr(Y t \X,)Pr(X,)dX t 


(2.58) 


Once the posterior distribution is determined, then all statistical inferences or 
estimates are made by integration or equivalently summation. For example, suppose 
we would like to obtain a prediction distribution, then it is obtained as 


Pr(W+i I Y t ) = J Pr(X t+ \ \X t , Y t ) x Pr(X t \ Y t ) dX t 


and a point estimate might be the conditional mean of this distribution, that is, 
E{X t+l \Y t } = f X t+ \Pr(X t+ \\Y t ) dX t+ \ 

while the variance of the distribution can be calculated similarly and used for perfor¬ 
mance evaluation. This calculation illustrates how information that can be estimated 
from the extracted distribution is applied in the estimation context. 

Again, even though simple conceptually, the real problem to be addressed is that 
of evaluating these integrals which is very difficult because they are only analytically 
tractable for a small class of priors and likelihood distributions. The large dimen¬ 
sionality of the integrals cause numerical integration techniques to break down which 


2 In Kalman filtering theory , the X, are considered the states or hidden variables not necessarily observable 
directly, while the Y, are observed or measured directly. 
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leads to the approximations we discuss subsequently for stabilization. Let us inves¬ 
tigate further to consider posing problems of signal processing interest. First, we 
review some of the basics. 

Many of the derivations and algorithms to follow are based on the simple, but 
sometimes not so obvious, manipulations of Bayes’ theorem; therefore, we develop 
these variants here first. Starting with a joint distribution of variables indexed by t, 
we apply Bayes’ rule 3 in steps to expose some of the manipulations: 

Pr(y(f), y(t - 1), y{t - 2)) = Pr(y(t), y(t - l)|y(t - 2)) x Pr(y(t - 2)) (2.59) 

Applying the rule again to the first term in this expression gives 
Pr (y(t),y(t - 1 )\y(.t - 2)) = Pr(y(t)|y(t - 1 %y(t - 2)) x Pr(y(f - l)|y(t- 2)) (2.60) 
Combining these results we have 

Pr(y(f),y(t - 1 ),y(t - 2)) = Pr(y(t)|y(t - 1 ),y(t ~ 2)) 

x Pr(y(t - l)|y(f - 2)) x Pr(y(t - 2)) (2.61) 

Additionally, if {y(t)j is considered first order Markov, then Eq. 2.61 simplifies 
even further to 

Pr(y(f),y(t - 1 ),y(t ~ 2)) = Pr(y(t)|y(t - 1)) x Pr (y(t - 1)| y(t - 2)) 

x Pr(y(f - 2)) (2.62) 

Generalizing these expansions, we can obtain the chain rule of probability which 
states that the joint distribution Y t can be decomposed using Bayes’ rule as 

Pr(Lf) = Pr(y(t)|F f _ 1 ) x Pr(K r _,) = Pr(y(f)|F f -i) x Pr(y(f - l)|K ( 2 ) x Pr(F ? _ 2 ) 

or expanding this expression, we obtain 


Pr(Ff) = f] Vv( y (t ~ k)\Y t - k -i) = Pr(y(t)|F r -i) x • • • x Pr(y(l)|7 0 ) x Pr(y(0)) 
*=o 

(2.63) 

Further assuming that the process is Markov, then 

lM.y(/)|F, .f) = Pr(y(/)|y(t - 1)) 


3 In set notation, we have (simply) that 


Pr (ABC) = PriAB\C) x Pr(C) = [Pr(A|BC) x Pr(B|C)] x Pr(C) 
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and the chain rule simplifies even further to 


Pr(F*) = Pr {y(t - k)\Y, k 0 = [~[ Pr(y(t - k)\y{t -k- 1)) (2.64) 

k =0 k=0 

Marginal distributions are used throughout these derivations and they follow 
directly from the Chapman-Kolmogorov evolution equation [18], that is, 

Pr(y(t)\y(t - 2)) = J Pr(y(t)|y(t - 1)) x Pr(y(f - 1)| y(t - 2)) dy(t - 1) (2.65) 

Before we close this section, we must mention that the derivations to follow rely 
on the following two fundamental assumptions: 

• the dynamic variable x(t) is Markov; and 

• the data y(t) are conditionally independent of the past dynamic variables, 
{x(t-k)}'ik> 0. 

Next we proceed with the development of generic Bayesian processors. 

2.5.1 Joint Posterior Estimation 

With this information in mind, we develop the sequential Bayesian solution to the joint 
posterior estimation problem. Starting with Eq. 2.56, we decompose this expression 
by first extracting the t‘ h term from the joint distributions, that is, 

Pr(P? \Xj) = Prfy(r), , |jc(f), X,_ i) 

and applying Bayes’ rule to obtain 

Pr(Pfl^) = Prfy(r)| Y,_\,x{t),X t _\) x ?r(Y,_ x \x(t),X t _ x ) (2.66) 

The data at time t, y(t), is assumed independent of X t -i and Y t -\ ; therefore, the 
first term of Eq. 2.66 simplifies to 

Pr(y(f)W0,Pr-l,^-t) —► Pr(y(t)\x(t)) (2.67) 

The second term also simplifies based on the independence of the past data and 
x(t) to give 

Pr(F f _iWf),X f _i) Pr(7 ( _i|X r _i) (2.68) 

We now have the final expression for the likelihood as 

Pr(Tf|X t ) = Yr(y(t)\x(t)) x Pr(K ; _,|W-i) (2.69) 
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Similarly, for the prior, extracting the t' 


1 and applying the rule, we obtai 


Pr(X r ) = Pr(x(t), X, = Pr(x(t)|X,-i) x Pr(Z f _i) 


Assuming x(t) is a first order Markov process, this expression simplifies t( 


Pr(X,) = Vr(x(t)\x(t - 1)) x Pr(X ? _i) 


Finally the evidence is obtained ii 


Pr(Ff) = Pr(y(f), Y,_ x ) = Pr(y(0|T r -i) x Pr(K,_,) 


Therefore, substituting these results into Eq. 2.56, we obtain 


[Pr(Tf-i |Af-i) x Pr(y(Q|x(Q)] [Yr(x(t)\x(t - 1)) x Pr(A,_|)] 
f ? Pr(y(t)|Tf-i) x Pr(F r _i) 


but the posterior at the previous time, ? — 1, is given by 

_ , v . v , Pr(F ( _i |X r _!) x Pr(X ? _0 


Identifying the terms on the right-hand side of Eq. 2.72 and grouping them together 
enables the joint sequential Bayesian posterior estimator to be expressed as 


fPr(y(0|x(t)) x Pr(x(t)|x(f - 1))] 


This result is satisfying in the sense that we need only know the joint posterior 
distribution at the previous stage, t— 1, scaled by a weighting function to sequentially 
propagate the posterior to the next stage, that is, 


Yr(X,\Y t ) = W(t,t- 1) x Pr(X r _i|7 ( _i) 


where the weight is defined by 


rPr(y(t)W0) x Yr(x(t)\x(t - 1))] 
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Pr[Xo|Y 0 ] => P r Lx,|Y,j => - =s> Pr[X_ a |Y w ] => Pr[X,|Y,J 



FIGURE 2.2 Sequential Bayesian processor for joint posterior distribution. 

The sequential Bayesian processor is shown diagrammatically in Fig. 2.2. Even 
though this expression provides the full joint posterior solution, it is not physically 
realizable unless the distributions are known in closed form and the underlying mul¬ 
tiple integrals or sums can be analytically determined. In fact, a more useful solution 
is the marginal posterior distribution. 

2.5.2 Filtering Posterior Estimation 

In this subsection we develop a more realizable Bayesian processor for the posterior 
distribution of the random x(t). Instead of requiring that the posterior possess the 
entire set of dynamic variables, X t , we need only restrict our attention to the current 
variable at time t. That is, for signal enhancement, we would like to estimate Pr(x(f) Y t ) 
where we restrict our attention to the value of the parameter based on all of the available 
measurements at time t. We start with the prediction recursion “marginalizing” the 
joint posterior distribution 


Pr(x(f)|Tf-i) = f Pr(x(0,x(t-l)|7 f _i)dv(f-l) 
Applying Bayes’ rule to the integrand yields 

Pr(x(f), x(t - l)|F f _!) = Pr(x(t)\x(t - 1), F ( _i) x Pr (x(t - l)|F t _i) 


or 


Pr (x(t\x(t - l)|F f _!) = Pr(x(t)\x(t - 1)) x Pr (x{t - l)|F f _!) (2.76) 

where the final expression evolves from the first order Markovian assumption. 
Applying the Chapman-Kolmogorov equation, the prediction recursion can be 
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written as 


PrWf)|n_i) = J Vr(x(t)\x(t - 1)) x Pr(x(f - 1 )\Y t ^)cbc(t - 1) (2.77) 

Examining the update or correction equation based on the filtering relation and 
applying Bayes’ rule, we have 


Pr(x(f)|F,) = 


Pr(x(f), Y,) 
Pr(F r ) 


Pr(x(f),y(f), Ff_i) 
Pr(y(t), Y t -i) 


Again applying the rule to this relation gives 


Pr(x(t)|F r ) = 


Pr(y(f)|x(f), Y t -i) x Pr(x(Q|F f _i) x Pr(F f _i) 

PrWOl^-OxPrOVi) 


(2.78) 


Canceling like terms in numerator and denominator and applying the Markov 
assumption, we obtain the final expression for the update recursion 4 as 


Likelihood Prior 

ftCKtMO) x pT^otevT) 
Pr(y(t)|F r _i) 


(2.79) 


where we can consider the update or filtering distribution as a weighting of the 
prediction distribution as in the full joint case above, that is. 


UPDATE WEIGHT PREDICTION 

pTwOTfr) = x Pr(x(t)Vr-l) ( 2 . 80 ) 


where the weight in this case is defined by 


' Pr(y(t)|F,-i) 


The resulting processor is shown diagrammatically in Fig. 2.3. 
We summarize the sequential Bayesian processor in Table 2.1. 


4 Note that this expression precisely satisfies Bayes’ rule as illustrated in the equation. 



2.6 SUMMARY 43 


y(f) 



FIGURE 2.3 Sequential Bayesian processor for filtering posterior distribution. 


TABLE 2.1 Sequential Bayesian (Filtering) Processor 

Prediction 

Pr(x(r)|F,_i) = J Pr(x(t)\x(t — 1)) x Pr(jc(t- \)\Y,_{) dx(t - 1) (prediction) 

where Pr(x(t)\x(t — 1)) (transition probabliity) 

Correction/Update 

PrW0|Fr) = Pr(y(0W0) x Pr(x(f)|F f -i)/Pr(y(f)|r t _i) (posterior) 

where Pr(y(f)|x(t)) (likelihood) 

Initial Conditions 
x(G) P{ 0) Pr(.r(0)|y 0 ) 


2.6 SUMMARY 

In this chapter we have developed the idea of statistical signal processing from the 
Bayesian perspective. We started with the foundations of Bayesian processing by 
developing the “batch” solutions to the dynamic variable (state) or parameter estima¬ 
tion problem. These discussions led directly to the generic Bayesian processor—the 
maximum a posteriori solution evolving from Bayes’ theorem. We showed the rela¬ 
tionship of MAP to the popular maximum likelihood estimator demonstrating the ML 
is actually a special case when the dynamic variable is assumed deterministic but 
unknown and therefore not random. We also include the minimum variance estimator 
(MV, MMSE) and mentioned the least-squares (LS) approach. After discussing some 
of the basic statistical operations required (chain rule, performance ( CRLB ), etc.), 
we developed the idea of Bayesian sequential or recursive processing. We started 
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with the full or joint posterior distribution assuming all of the information about the 
dynamic variable and observations was available. The solution to this problem led to a 
sequential Bayesian processor. Lastly we followed with the development of a solution 
to the more practical marginal posterior or “filtering” distribution and illustrated the 
similarity to the previous processor. 


MATLAB NOTES 

MATLAB is command oriented vector-matrix package with a simple yet effective 
command language featuring a wide variety of embedded C language constructs 
making it ideal for statistical operations and graphics. Least-squares problems 
are solved with the pseudo-inverse (pinv). When the covariance is known (min¬ 
imum variance) the (Iscov) command can be applied. Individual linear algebraic 
techniques including the singular-value decomposition, qr-decomposition (Gram- 
Schmidt) and eigen-decomposition techniques (svd, qr, eig, etc.) are available. 
The Statistics toolbox offers a wide variety of commands to perform estimation. 
For instance, “fit” tools are available to perform parameter estimation for a vari¬ 
ety of distributions: exponential (expfit), Gaussian or normal (normfit), Poisson 
(poissfit), etc. as well as the generic maximum likelihood estimation (mle) as 
well as specific likelihood estimator for negative Gaussian/normal (normlike), 
negative exponential (explike). 
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PROBLEMS 

2.1 Derive the following properties of conditional expectations: 

(a) E x {X\Y}=E{X} if X and Y are independent. 

(b) E{X} = E y {E{X\Y}}. 

(c) E x {g{Y)X}=E y {g{Y)E{X\Y}}. 

{d) E xy {g{Y)X}=E y {g{Y)E{X\Y}}. 

(e) E x {c\Y}=c. 

{f) E x {g(Y)\Y} = g(Y). 

(g) E xy {cX + dY\Z}=cE{X\Z} + dE{Y\Z}. 

2.2 Verify the following properties: 

(a) V x (a'b) = (V x a')b + (V x b')a, for a,b& R n and functions of x. 

(b) V x (b'x) = b. 

(c) V A (jr'C) = C, C G R nxm . 

(d) V x (x') = I. 

(e) VJx'x) = 2x. 

( f) X x (x'Ax) = Ax + A'x, for A not a function of x. 

2.3 Show that for any unbiased estimate of x(y) of the parameter vector x the 
Cramer Rao bound is 

Cov(x|x) > I -1 for x = x = x(y) and x e U n ,l e TZ nxn 
where the information matrix is defined by 


1 := -E y {S7 x (y x lnPr(Y\X))'} 
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2.4 The sample mean, 9 = l/N Y^tLi 0(O> is a very important statistic in data 
processing. 

(a) Show that the sample mean is an unbiased estimator of 0 ~ £xp( I/O). 

(b) Estimate the corresponding variance of 6. 

(c) Show that the sample mean is a consistent estimator of 0. 

id) Construct the two standard deviation confidence interval about the sample 
mean estimator. (Hint: Let 9~N(9, a 2 ), a 2 known.) 

(e) Show that 9 is a minimum variance estimator. (Hint: Show it satisfies the 
Cramer Rao bound with equality). 

(f) If 9 is an estimator or 9, calculate the corresponding mean-squared error. 

2.5 Let x(t) be an unknown dynamic parameter obtained from the linear measure¬ 
ment system 

y(t) = C(t)x(t)+v(t) 

where v ~ M( 0, R vv (t)) with y,v € 1Z P and x is governed by the state transition 
mechanism 

x(t + 1) = 4>(t + 1, t)x(t) for 4> e TZ nxn 

Calculate the Cramer Rao bound for the case of estimating the initial state x(0). 

2.6 Suppose we have the following signal in additive Gaussian noise: 

y = x + n with n ~ N(0, R„„) 

(a) Find the Cramer-Rao bound if x is assumed unknown. 

(b) Find the Cramer-Rao bound if p(x) = xe~ x ! Rm , x > 0. 

2.7 Suppose we have two jointly distributed random vectors x and y with known 
means and variances. Find the linear minimum variance estimator. That is, find 

xmv — Ay+ b 

and the corresponding error covariance cov x, x. = x — x. 

2.8 We would like to estimate a vector of unknown parameters pcR n from a 
sequence of /V-noisy measurements given by 

y=Cp + n 

where CeR pxn ,y, ncRP, and v is zero-mean and white covariance R nn . 

(a) Find the minimum variance estimate pmv- 

(b) Find the corresponding quality cov(p —pMv)- 
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2.9 Suppose we have N samples (x(0)... x(N — 1)} of a process x(t) to be estimated 
by N complex sinusoids of arbitrary frequencies {/o,..., f n -\ }• Then 

N -1 

x(k AT) = J2 a m ex\)(j2jif m k AT) for k = 0,..., N - 1 

Find the least-squares estimate ais of {«„,}. 

2.10 Suppose we are given a measurement modeled by 


y(t) = s + n(t) 


where s is random and zero-mean with variance crj = 4 and n is zero-mean and 
white with a unit variance. Find the two-weight minimum variance estimate 
of s, that is, 

2 

*MV = w ‘ y(i) 

that minimizes 

J = E{(s - S) 2 } 


2.11 Find the maximum likelihood and maximum a posteriori estimates of the 
parameter x 


p{x) = ae~ ax x > 0, a > 0 


and 


p(y\x) =xe x > 0, y > 0 

2.12 Suppose we have two classes of obj ects, black and white with shape subclasses, 
circle and square and we define the random variables as: 

X\ = number of black circular objects 
X2 = number of black square objects 
X3 = number of white objects 
Assume that these objects are trinomially distributed such that: 


Pr(x|0) = 


/ (xi + x 2 + x 3 )! 
V xi!x 2 !x 3 ! 


)G 


Suppose a person with blurred vision cannot distinguish shape (circle or 
square), but can distinguish between black and white objects. In a given batch 
of objects, the number of objects detected by this person is: y = [yi yi]' with 


yi = x\ + x 2 
3*2 = * 3 
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(a) What is the probability, Pr(yi 10)? 

(b) What is the expression for the E-step of the FM-algorithm assuming ©* is 
the current parameter estimate, that is, find the expression for E{x\ |yi, ©*} 
with X3 known? 

(c) What is the corresponding M-step? 

(d) Take the EM solution (a)-(c) above based on 100 samples with yi = 100 
and start iterating with ©o = 0 for 10-steps, what is your final estimate of 
the parameter, @io? {©true = 0.5) for simulation x\ = 25, X2 = 38. 

2.13 Suppose we have a bimodal distribution consisting of a Gaussian mixture 
with respective means, variances and mixing coefficients: {/ri,of,pi} and 
{/i-2, rr^piS such that p x (x) = i PiN(l- l i, of) withp2 = 1 — p\. We would 
like to fit the parameters, © = (pi, /xi, of, /12, of) to data. Develop the EM 
algorithm for this problem. 

2.14 Suppose we are given a measurement system 

y(t) = x(t) + v(t) t = 

where v(t) ~ N(0, 1). 

(a) Find the maximum likelihood estimate of x(t), that is, xmlO), for 
t = l,... ,N. 

(b) Find the maximum a posteriori estimate of x{t) that is, x\jAp(t), if 
p(x) = e-\ 

2.15 Suppose we have a simple AM receiver with signal 

>•(/) = 0.v(/) + v(l) t=\,...,N 

where 0 is a random amplitude, ,v is the known carrier, and v ~ N( 0, R vv ). 

(a) Find the maximum likelihood estimate ©ml- 

(b) Assume © ~ A(@o, R (-)(-_))■ Find the maximum a posteriori estimate ©map- 

(c) Assume © is Rayleigh-distributed (a common assumption). Find ©map- 

2.16 We would like to estimate a signal from a noisy measurement 

y = s+v 

where v ~ N( 0,3) and s is Rayleigh-distributed 
p(s) = se-^ 


ms)= Tm‘ 





(a) Find the maximum likelihood estimate. 

(b) Find the maximum a posteriori estimate. 

(c) Calculate the Cramer-Rao bound (ignoring p(s)). 


PROBLEMS 49 


2.17 Assume that a planet travels an elliptical orbit centered about a known point. 
Suppose we make M observations. 

(a) Find the best estimate of the orbit and the corresponding quality, that is, 
for the elliptical orbit (Gaussian problem) 

/hx 2 + fty 2 = i 

Find 3 = lPifa]'. 

(b) Suppose we are given the following measurements: (x,y) = {(2,2), (0,2), 
(—1,1), and (—1,2)}, find f> and J, the cost function. 

2.18 Find the parameters, >3i, and ft such that 

fit) = Pit + ftf 2 + ft sin? 

fits (f, /(t)): = {(0,1), (7t/4,2), (n/2, 3), and (it, 4)} with corresponding quality 
estimate, J. 
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SIMULATION-BASED 
BAYESIAN METHODS 


3.1 INTRODUCTION 

In this chapter we investigate the idea of Bayesian estimation [1-13] using approx¬ 
imate sampling methods to obtain the desired solutions. We first motivate the 
simulation-based Bayesian processors and then review much of the basics required 
for comprehension of this powerful methodology. Next we develop the idea of 
simulation-based solutions using the Monte Carlo ( MC) approach [14-21] and intro¬ 
duce importance sampling as a mechanism to implement this methodology from 
a generic perspective [22-28], Finally, we consider the class of iterative proces¬ 
sors founded on Markov chain concepts leading to efficient techniques such as the 
foundational Metropolis-Hastings approach and the Gibbs sampler [29-37]. 

Starting from Bayes’ rule and making assertions about the underlying probabil¬ 
ity distributions enables us to develop reasonable approaches to design approximate 
Bayesian processors. Given “explicit” distributions, it is possible to develop analytic 
expressions for the desired posterior distribution. Once the posterior is estimated, 
then the Bayesian approach allows us to make inferences based on this distribu¬ 
tion and its associated statistics (e.g., mode, mean, median, etc.). For instance, in 
the case of a linear Gauss-Markov (GM) model, calculation of the posterior dis¬ 
tribution leads to the optimal minimum variance solution [11], But again this was 
based completely on the assertion that the dynamic processes were strictly con¬ 
strained to Gaussian distributions and a linear GM model therefore leading to a 
Gaussian posterior. When the dynamics become nonlinear, then approximate meth¬ 
ods evolve based on “linearization” techniques either model-based (Taylor series 
transformations) or statistical (sigma-point transformations). In both cases, these 
are fundamentally approximations that are constrained to unimodal distributions. 
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What happens when both the dynamics and statistics are nonstationary and non- 
Gaussian? Clearly, these approaches can be applied, but with little hope of success 
under most conditions. Therefore, we must resort to other less conventional (in signal 
processing) ways of attacking this class of problems that have dominated the science 
and engineering literature for a long time [38-61], This question then leads us directly 
to statistical simulation-based techniques invoking random sampling theory and the 
Monte Carlo ( MC ) method—well-known in statistics for a long time [14], This method 
is essentially a collection of techniques to estimate statistics based on random sam¬ 
pling and simulation. It can be thought of simply as “performing estimation through 
sampling”. The goal of Bayesian techniques using MC methods is to generate a set 
of independent samples from the target posterior distribution with enough samples 
to perform accurate inferences [21], Monte Carlo techniques have been applied to a 
large variety of applications in science and engineering [56, 57, 61]. In this chapter, 
we start out with the simple idea of random sampling with the underlying motive 
of developing Monte Carlo simulation techniques to solve nonlinear/non-Gaussian 
signal processing problems. 

MC methods involve techniques to estimate the posterior distribution of interest 
using either numerical integration-based methods (when possible) or sample-based 
simulation methods which attempt to produce independent-identically-distributed 
( i.i.d .) samples from a targeted posterior distribution and use them to make statistical 
inferences. Following Smith [18], we develop this idea further starting out with Bayes’ 
rule for the variable or parameter, X, and the corresponding data, Y, that is, the 
posterior distribution is given by 


Pr(Z|F) = 


Pr(F|X) x Pr(X) 
PfiT) 


Pr(F|X) x Pr(X) 
/Pr(F|X) x Pr (X)dX 


(3.1) 


with the usual definitions of likelihood, prior and evidence (normalization) distribu¬ 
tions. Once the posterior is estimated, then the inferences follow immediately. For 
instance, the conditional mean is given by 


E{X\Y} = J XVx{X\Y)dX 


(3.2) 


In the continuous (random variable) case the explicit evaluation of these integrals is 
required, yet rarely possible, leading to sophisticated numerical or analytical approx¬ 
imation techniques to arrive at a solution. These methods rapidly become intractable 
or computationally intensive to be of any real value in practice. Thus an alternative 
method is necessary. 

Just as we can use the distribution to generate a set of random samples, the samples 
can be used to generate (approximately) the underlying distribution (e.g., histogram). 
The idea of using samples generated from the prior through the likelihood to obtain 
the posterior parallels that of using analytic distributions. For example, ignoring the 
evidence we use the method of composition [16] to sample: 


Pr(X|F) oc Pr(F|X) x Pr(X) 


(3.3) 
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so that we can 

• Generate samples from the prior: X t —> Pr(X); i = l,... ,N; 

• Calculate the likelihood: Pr(Y|Y,); and 

• Estimate the posterior: Pr(X|F)« Pr(K|Y ; ) x Pr(Y,). 

Unfortunately, this methodology is not realistic unless we can answer the following 
questions: (1) How do we generate samples from distributions for simulations? and 
(2) How do we guarantee that the samples generated are i.i.d. from the targeted 
posterior distribution? 

However, before we develop the idea of sampling and simulation-based methods, 
we must have some mechanism to evaluate the performance of these samplers. In the 
next section, we briefly describe techniques to estimate the underlying probability 
distributions from data samples. 

3.2 PROBABILITY DENSITY FUNCTION ESTIMATION 

One of the requirements of Bayesian signal processing is to generate samples from 
an estimated posterior distribution. If the posterior is of a known closed form (e.g., 
Gaussian), then it is uniquely characterized by its particular parameters (e.g mean 
and variance) that can be “fit” to the samples directly using parameter estimation 
techniques. But, if the probability density 1 is too complex or cannot readily be repre¬ 
sented in closed form, then nonparametric techniques must be employed to perform 
the estimation. 

Since most parametric forms rarely fit the underlying posterior distribution encoun¬ 
tered in the real world, especially if they are multimodal (multiple peaks) distributions, 
we investigate the so-called kernel (smoothing) method of PDF estimation. The basic 
idea is to fit a simple model individually at each target sample (random variate) loca¬ 
tion, say x, using the observations close to the target, p(x). This estimate can be 
accomplished by using a weighting function or equivalently, kernel function, /C(x; x,) 
that assigns a weight to x,- based on its proximity to x [4]. Different kernels lead 
to different estimators. For instance, the classical histogram can be considered a 
rectangular or box kernel PDF estimator [5], while the fundamental Parzen win¬ 
dow multidimensional estimator is based on a hypercube kernel [3]. Some of the 
popular kernel smoothing functions include the triangle, Gaussian, Epanechnikov 
and others [4], In any case, kernel density estimation techniques [6] provide the basic 
methodology required to estimate the underlying PDF from random data samples and 
subsequently evaluate sampler performance—one of the objectives of this chapter. 

Theoretically, the underlying principle of distribution estimation is constrained by 
the properties of the PDF itself, that is, 

• PxC*) > OVx e 1Zx\ and 

• fn x PyM dx=\. 

1 In this section we introduce more conventional notation for both the cumulative and density or mass 
functions, since it is required. The CDF of the random variable X with realization x is defined by Px(x), 
while the PDF or PMF is p x (x). In instances where it is obvious, we do not use the subscript X to identify 
the random variable. 
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Formally, the probability that the random variate, x, will fall within the interval, 
xi<X<Xj for i <j is given by [3] 

Pr(x, < X < xj) = J p x (x)dx ~ p x (x) x A t for A t I (small) (3.4) 

which is simply the area under the PDF in the interval (x,-,x ; ). Thus the probability 
that X is in a small interval of length A x is proportional to its PDF, p x (x). 

Therefore, it follows that the probability at a point, say x, can be approximated by 
allowing Xj = x and Xj=x+ A x , then 

Pr(x < x < x + A x ) = J p x (x)dx for xeH x (3.5) 

demonstrating that 

Pr(x<x <x+ A x ) 

PxM = A l™ 0 -^- ( 3 - 6 ) 

Consider a random draw of N-i.i.d. (total) samples according to p x (x) above. The 
probability that n,v of these reside within 1Z X is binomial and can be approximated 
by the frequency ratio, n^/N, leading to a relative frequency interpretation of the 
probability. That is, it follows that 

PxOc) x A, = ^ (3.7) 

But this is just the well-known [1] histogram estimator given by 

„ „ „ 1 riN , A, 

p x (x) := p x {x) = — x — \x~x\<— (3.8) 

Here x is located at the midpoint of the interval (bin), x — A 2 X < x < x + A 2 '- and 
the corresponding probability is assumed constant throughout the bin as shown in 
Fig. 3.1. If 

P/vW « (3.9) 

then as N —*■ oo, histogram estimator converges to the true PDF that is, p w (x) —> p x (x) 
and the following three conditions must hold: 

1. lim A* 0; 

AM- oo 

2. lim nN m oo; and 

AMoo 

3. lim f- -* 0. 

AM oo N 

The first condition must hold to achieve the limit in Eq. 3.6, while the second condition 
indicates how «v is to grow to guarantee convergence. Thus, at all nonzero points of 
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FIGURE 3.1 Estimated PDF using both histogram and kernel (Gaussian window, AT(0,10)) 
methods for random samples generated from a Gaussian mixture: A/i (0,2.5) and A/2(5,1) 
and mixing coefficients. Pi =0.3 and P 2 = 0.7. 

the PDF by fixing the size of the interval, the probability of samples falling within 
this interval is finite and therefore, ~ Pr(x) x N and tin —■► oo as N —> oo. The final 
condition arises arises because as the interval size shrinks, A x —> 0, the corresponding 
probability Pr(x) — > 0. These conditions indicate the manner in which the parameters 
should be chosen: N large. A* small with n n large. Satisfaction of all of these 
conditions imply convergence to p x (x) (see [3, 5] for more details). 

There are two common ways of ensuring convergence: (1) “shrink” the region 
by selecting the interval as a function of the number of samples such as A a = 4= 
(Parzen-window method) 2 ; and (2) “increase” the number of samples within the 
interval which can be done as n jy = Vn. Here the local region grows until it encloses 
«,y-neighbors of x (nearest-neighbor method). Both methods converge in probability. 
This is the principle theoretical result in PDF estimation [3, 5, 6], Next consider the 
kernel density estimator as a “generalized” histogram. 

Let X],... ,x,v be a set of Ltd. samples of the random variate, x, then the kernel 
density estimator approximation of the underlying probability density at x is given by 

<3,0, 

2 It has been shown that the “optimal” bin (interval) size can be obtained by searching for the bin that 
minimizes the cost function, C(A X ) = [7], 
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where /C is defined as the kernel density (non-negative and integrates to unity) with 
A x the corresponding smoothing parameter or bandwidth used to “tune” the estimate. 
The classic bias-variance tradeoff arises here. That is, for smaller A x (high resolution) 
yields smaller bias, but higher variance. However, a larger A, v (low resolution), gives 
a larger bias with smaller variance [5], We demonstrate the application of kernel 
techniques in the following example of estimating a multimodal distribution. 

Example 3.1 

Suppose we would like to estimate the PDF of a set of 1000-samples generated from 
a mixture of two Gaussian densities such that Af(fi x , (xjr) -* A/) (0,2.5) and A/2(5,1) 
with mixing coefficients, Pi — 0.3 and Pi = 0.7. We use both the histogram and kernel 
density estimators. For the histogram we choose N = 20 bins to illustrate (graphically) 
the bin estimates, while for the kernel density estimator we choose a Gaussian window 
specified by K(x) = -^= e~^ with an optimal bandwidth of A x = a x x (^) 5 [3], 

The results are shown in Fig. 3.1 for 100-bins or points where we see the estimated 
PDF using both methods along with the “true” Gaussian density. Clearly the Gaussian 
kernel estimator provides a much “smoother” estimate of the true density. A smoother 
histogram estimate can be obtained by decreasing the bin-width. Both techniques 
capture the general shape of the bimodal distribution. AAA 

This completes the section on PDF estimation techniques. Note such details as 
selecting the appropriate bin-size for the histogram and the bandwidth for the kernel 
density estimator as well as its structure (Gaussian, Box, etc.) is discussed in detail in 
[3,5-7]. These relations enable “tuning” of the estimators for improved performance. 

Next, we consider the simulation-based methods in and effort to answer the ques¬ 
tions posed previously: (1) How do we generate samples from distributions for 
simulations? and (2) How do we guarantee that the samples generated are i.i.d. from 
the targeted posterior distribution? 

3.3 SAMPLING THEORY 

The generation of random samples from a known distribution is essential for simu¬ 
lation. If the distribution is standard and has a closed analytic form (e.g., Gaussian, 
exponential, Rayleigh, etc.), then it is usually possible to perform this simulation 
easily. This method is called the direct method because it evolves “directly” from the 
analytic form. An alternative is the “inversion” method which is based on uniform 
variates and is usually preferred. Thus, the uniform distribution becomes an extremely 
important distribution in sampling because of its inherent simplicity and the ease in 
which samples can be generated [15]. 

Many numerical sample generators evolve by first generating a set of random 
numbers that are uniformly distributed on the interval [a, b] such that all of the subin¬ 
tervals are of equal length and have “equal probability”. Thus, X is said to be a uniform 
random variable 


X ~ U(a,b) 


(3.11) 
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that has a cumulative distribution function (CDF) defined by 

1 0 x<a 

a <x < b (3.12) 

1 x>b 

along with the corresponding probability density 

Px(x) = t—- a<x<b (3.13) 

with mean and variance: ji x = and o\ = • 

Because of its simplicity and ease of numerical generation, the uniform variate is 
the starting point of many simulation schemes. For instance, to generate a binomial 
random number (e.g., coin tossing with Pr(heads) = p), we first generate uniform 
random variates and then count the number of times they are greater than p. The result 
is a set of binomial variates with trial parameter, N, and success (heads) probability, p. 

Since statistical sampling techniques are based on the generation of random sam¬ 
ples from a given distribution, we must understand the relationship between an input 
PDF, p x (x) and an output PDF, p y (y) when there exists a relationship between the 
random variables x and y defined by: y = y(x). The problem is: 

GIVEN, x and p x (x), FIND the output PDF, p Y (y) defining the probability of y. 

When we have functions of random variables with known analytic distributions, 
then the usual transformation method applies. That is, given the known distribution, 
P x (x) or density p x (x) and the monotonic, one-to-one, invertible transformation, 
y = T(x), then the distribution or density of y is found by the using the transformation 
relation [1], 


PyOO = Px(x) x | ■£ | = p x (x = T ! (y)) x | ^ | (3.14) 

where | ^ | is the Jacobian of the transformation and T~ l is its inverse. The derivation 
of this relationship follows directly from the definitions of CDF [2]. Although simple, 
this relation establishes one of the basic concepts in random sampling. 

Perhaps the most fundamental relation is obtained by applying the transformation 
directly to the cumulative distribution function ( CDF) defined by 

y(x) = Py(x) = J p x (a) da (3.15) 


That is, applying the transformation of 3.14 and taking the required derivatives 
gives 


dy 

dx 


dPx(x) 

dx 


= Px(x) 


(3.16) 
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which follows from the fundamental theorem of calculus. Therefore we have 

PyOO = y(x) = , Px(A) , = 1 (3.17) 

|p*c*)l 

yielding a uniformly distributed random variable, y ~ U{ 0,1), that is, 

Py(y) = y(x) =1 for 0 < y < 1 (3.18) 

Thus, the CDF of any random variable is always uniformly distributed on the inter¬ 
val, [0,1], independently ofp x (x)! This important result enables us to “sample” from 
any arbitrary py(x), since the random variable, x can be sampled by first generating 
samples from y ~ U(0, 1) and then performing the inversion x = P^ 1 (y). This is called 
the inversion method in sampling theory that simply entails generating random sam¬ 
ples from U{ 0,1) and determining the samples x by inversion of its (known) CDF. 
Note that the CDF can be known analytically or just in tabular form to perform the 
inversion. We discuss this theorem more formally in the next section. First, consider 
the following example to illustrate this idea. 

Example 3.2 

Suppose we have a uniform random variable, x~U( 0,1) and a transformation, 
y = T{x) = — \ lnx. We would like to know the analytic form of the density, py(y). 
Since x is uniform, the density is p x (x) = = 1 ? the inverse transformation is, 

x = T~ 1 (y) = e~ Xy \ therefore, taking the derivative and substituting its absolute value 
into Eq. 3.14 gives 

PyOO = (1) x \-ke Av | =ke 

an exponential distribution with parameter, X. AAA 

Transformations of discrete random variables are much simpler where we use 
the probability mass and discrete cumulative distribution functions in place of their 
continuous counterparts. In this case the transformation is still continuous with the 
identical conditions (invertible, etc.), that is, 

yi = T( Xi ) i=l,...,N (3.19) 

and therefore, we have the discrete CDF as 


Py(y) = Pr(F < yf) = Pr(X < x t ) = Pr(X < x t = 7^'(>-,)); i= I..... N (3.20) 


3.3.1 Uniform Sampling Method 

As noted, the uniform distribution plays a vital role in simulating random variates 
and is applied heavily in sample-based simulation schemes. We formally present two 



3.3 SAMPLING THEORY 59 


fundamental theorems and a corollary [1] to justify the theoretical foundation of this 
approach and motivate their use through some simple examples. 

Uniform Transformation Theorem Given a random variable, X, with distribution, 
Px(x), there exists a unique, monotonic, transformation, T(X) such that the random 
variable, U = T{X) is distributed as U(0, 1), that is, if T(X) is selected as 

T(X) = U 


then 


U ~ U(0, 1) 

The proof of this theorem follows directly [1] from the properties of the CDF. 
Suppose v is arbitrary and u = Px(x). Then with a monotonic Px(x), we have U <u 
iff X < x and therefore it follows that 

P uiu) = Pr (U <u) = Pr(X < x) = P x (jc) = u 

giving the desired result. This is a strong result stating that any CDF can be represented 
in terms of a set of uniform random variates as discussed before. 

The second theorem [1] provides the basis of most sample-based simulations and 
is reliant on the existence of the “inverse” of the CDF. 

Inverse Transformation Theorem Given a uniform random variable, U ~ U{Q, 1) 
and a distribution (target), Px(x), then there exists a unique, monotonic, transforma¬ 
tion, T(U), such that the random variable, X = T(U ) is distributed as Py(x), that is, 
if T(U ) is selected as 

T(U) = X = Px\U) 

then 


X^Vx(x) 

Again the proof of this theorem follows directly from the properties of the CDF 
[1]. From the Uniform Transformation Theorem above, we know that U is uniform 
and Pv(-Y) is arbitrary; therefore, U = ?x(X) —» X = P^ 1 (U) and it follows that 

Px(.X <x) = P x (X < P X \U)) = P x (P^(l/)) = u= P x (X) 

providing the proof. We illustrate this technique with the following example. 

Example 3.3 

Suppose we would like to generate a random variable, X, exponentially distributed 
with parameter X, X ~ Exp(k). From the Uniform Transformation Theorem, we have 
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and 

x = P^(t/) = -^ln(l-«) 

To generate exponential random variables: 

• Generate uniform samples: m, —> U( 0,1) 

• Transform to exponential: x, = — i- ln(l — «,•) 

Also since ln(l — ~ U(0, 1), then the exponential variates are generated more 

efficiently from x, = — ] ln(u,). AAA 

To further illustrate the inverse CDF method, we apply it to a discrete problem that 
occurs quite frequently in nonlinear processing. 

Example 3.4 

Suppose we have estimated a discrete empirical distribution representing a posterior 
density given by 

N 

PxO) = X) W ‘ S(x ~ Xl ) 

with corresponding CDF 


N 

h(x) = J2 w d x < x - x i) 

where jiix) is a unit-step function, IT, := Pr(X = x,) and of course, jJEqi Wj= 1. 
The CDF is shown in Fig. 3.2a. Using the inverse CDF method, we can generate 
realizations of X = x; by: 

• Generate uniform samples: u k ~ZY(0,1) 

• Simulate samples: x, = P^ 1 (m) or 

xi = x k for Px(x*_i) <u k < V x {x k ) 
or using the empirical distribution 

k -1 k 

Xi = Xk for 22 Wj <u k < 22 Wj 
j-l j= l 


This transformation is illustrated in Fig. 3.2b. AAA 

Summarizing, we first generate a uniform random variate, m, and then “bracket” its 
probability to determine the desired random variable, x, . So we see from these exam¬ 
ples that using the inverse CDF method enables us to generate continuous and discrete 
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FIGURE 3.2 Empirical Distribution: (a) Estimated CDF. (b) Estimated ICDF. 

random variates from both the analytic function as well as discrete samples from an 
empirical distribution. We will apply this technique frequently in the simulation-based 
processor. This approach is also used heavily in the resampling algorithms to follow 
(e.g., uniform, systematic, etc.). Numerical methods can also be used to perform the 
inversion by solving Px(X) —(7 = 0 for X [5,15]. 

Before we close this section, let us note that the above theorems can be generalized 
to simulate any distribution from any other distribution using the uniform variates as 
an intermediate step leading to the following corollary. 

Corollary Given a distribution, PxCX - ) and a corresponding target distribution, 
Py(Y), then there exists a unique, monotonic, transformation, T(X), such that the 
random variable, Y = T(X), is distributed as Py(y), that is, if T(X') is selected as 

T(X) = Y = P^CPxM) 


then 

Py(P <y) = Py(v) 

The proof of this corollary follows by applying the results of the previous theorems 
with U — Py(x) and Y = Py'((/) for U ~ U(0, 1). Consider the following example to 
demonstrate this approach to simulation. 

Example 3.5 

Consider a random variable, X, distributed as X ~ P^ (x) = 1 — e~ x — x e~ x . We would 
like to generate a random variable, T, exponentially distributed with parameter X, 
Y ~ Exp(k). Then from the Corollary, we have 


= Py(y) = 1 — e Xy and y = Py X ((7) =-ln(l — u ) 
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Thus it follows that 

PykPzW) = ~ ln(l - P x (x)) = ~ Me-' +xe~ x ) 

Therefore we have 

y, = — j- ln(e~*‘ + Xi e~ Xi ) 

Sampling from the uniform (as before), we first generate {x,} and then the desired set 
of random variates, {y,}. AAA 

We complete this subsection by stating the Golden Rule of Sampling [22] given by: 

• Generate uniform samples: n,- —»■ U(0, 1); i= \ ,... ,N\ 

• Define the transformation: n, = Py(x,j; and 

• Apply the inverse transformation: x, = P^ 

Next we investigate a more general approach. 

3.3.2 Rejection Sampling Method 

In order to use the uniform sampling simulation methods of the previous section, we 
require the inverse CDF, which is not a simple task, even when the distribution is 
known in closed form. The rejection sampling method offers an alternative that not 
only eliminates the inversion problem, but also becomes an integral part of many of the 
sophisticated sampling algorithms to follow because of its simplicity and generality. 
In principle, the rejection method can be applied to any distribution with a density 
given up to a normalization constant [32], 

The basic sampling problem in this context is that we are trying to generate samples 
from a density (or distribution) that is capable of being evaluated and we have a 
function 

Px(*l = c x Pr(X) 

where Pr(X) is the target distribution and p z (x) is related and computable up to 
the normalization constant c which is not known. Suppose we select a sampling 
distribution, say q(x), such that there exists a “covering” constant M with 

p x (x) < Mq(x ) Vx 

The rejection sampling method is illustrated in Fig. 3.3 (c = 1) and summarized as: 

• Generate a sample: Xk —> q(x) 

• Generate a uniform sample: w* —> U(0, 1) 

• ACCEPT the sample: x, = x k if u k < 

• Otherwise, REJECT the sample and generate the next trial sample: x, 
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FIGURE 3.3 Rejection Sampling Method. 


The proof of this method is given in Liu [32] using an indicator function, T(x), 
defined by 


f 1 if x Accepted 
[0 if x Rejected 


(3.21) 


Proceeding, 

Pr (l(x) = 1) = f Pr(J(x) =]\X = x) q(x) dx = J q(*) dx = 


and 


Pr(x| l(x) = 1) = - 


which shows that the acceptance of the sample corresponds to sampling directly from 
the target distribution, Pr(x). 

The expected number of samples to accept an x ~ Pr(x) is M and therefore, the key 
to using this approach is to select a good proposal distribution q(x) with a low M —a 
nontrivial problem. Note that Example 1.1 in which the area of a circle was estimated 
using sampling methods is a simple geometric illustration of this methodology. The 
following example from Papoulis [1] demonstrates the method in an analytic form. 


Example 3.6 

Suppose we are given a random variable, x ~ Exp(l) and we would like to simulate 
the random variable with truncated Gaussian, y ~ Af(0, 1), that is, we have 

2 -V 

p(x) = —— e 2 x fi(x) and q(x) = e 

V2n 

where ji{x) is a unit step function as before, then for x > 0, we have 
p(x) fie rfetg. * 

q(x) V TT 
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Setting M — we have the acceptance/rejection sampling rule: 

Xi = Xk if Uk < e (Accept) AAA 

This completes the background on sampling theory, next we consider the Monte 
Carlo approach. 


3.4 MONTE CARLO APPROACH 


The Monte Carlo approach to solving Bayesian estimation problems is to replace com¬ 
plex analytic or unknown probability distributions with sample-based representations 
to solve a variety of “unsolvable” problems in inference (integration, normalization, 
marginalization, expectation, etc.), optimization (parameter, MAP, ML estimation), 
statistical mechanics (Boltzmann equation), nuclear physics (fission, diffusion, etc.) 
[32,35]. The key to the MC approach is to generate independent random samples from 
a probability distribution, Pr(X), usually known only up to a normalizing constant 
(evidence) [32], Typically, generating independent samples from this distribution is 
not feasible implying sample dependencies or using a proposal distribution, q(X), that 
is similar to but not the exact target distribution, Pr(X). Independent ( i.i.d .) samples 
are important because both strong and weak Laws of Large Numbers (mean converges 
in distribution to population mean) ensures that the inferences (e.g., mean, variance) 
can be made as accurate as desired by increasing the number of samples. However, 
the samples can be dependent and still properly reflect the probability of the target 
distribution opening the possibility of Markov chain methods (see Sec. 3.4.1). 

The rejection method, just discussed, [22], importance sampling [16] and 
sampling-importance-resampling [26] are methodologies that do employ a proposal 
distribution. The Metropolis technique and its variants provide the foundation for this 
approach using Markov chain concepts generating dependent samples from a chain 
with Pr(X) as its invariant or stationary distribution. In this section we develop the idea 
of the MC simulation-based approach to Bayesian estimation using iterative rather 
than sequential Monte Carlo techniques. 

Theoretically, the Monte Carlo approach to sampling is based on the following 
principles [10, 27, 29, 30, 35]. Here N i.i.d. samples, \X(i)}^ =v are drawn from the 
target density, p(X), to produce an estimate of the empirical distribution (density) 

1 N 

Pr(X) = p N (X) = -J2 S(X ~ X(i)) 0.22) 


which can be used to approximate integrals (pdfs, areas, etc.) with sums, that is. 


1 N j At , 

I N (f) = - £/(X)«(X - X(i)) = - 0 ) ^ Kf) = J f(X)p(X)dX 


(3.23) 
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Here /,v(/j is unbiased and converges (almost surely) to 1(f) according to the strong 
Law of Large Numbers. Its corresponding variance is bounded, (oy < oo) and 

a 2 

var (I N (f)) = -L (3.24) 

A central limit theorem argument shows that the error converges in-distribution as 

VN(.I N (f) - 1(f) ) W -^° AA(0, a}) (3.25) 

The main advantage of MC over deterministic integration is that it positions its samples 
over regions of high probability. 

In signal processing, we are usually interested in some statistical measure of a 
random signal or parameter expressed in terms of its moments. Let us take a slightly 
more detailed look at just how MC concepts can be applied with this perspective 
in mind. Suppose we have some signal function, say f(X), with respect to some 
underlying probabilistic distribution, Pr(X). Then a typical measure to seek is its 
performance “on the average” which is characterized by the expectation 

E x {f(X)} = J f(X)fr(X)dX (3.26) 

From the Bayesian perspective, the embedded distribution can be thought of as 
the “posterior” of the signals/parameters. The Bayesian approach must integrate over 
high-dimensional probability distributions to make inferences about parameters or 
predictions about signals. Unless the integral is analytically tractable, the usual 
method of evaluation is through numerical integration (deterministic) techniques. 
Unfortunately the number of points to evaluate both/(-) and Pr(-) increases exponen¬ 
tially with the dimensionality of the signal/parameter space. Also it is not possible to 
evaluate this integral over the entire space in practice; therefore, we concentrate on 
specific regions where the integrand is dense (not null). Instead of attempting to use 
numerical integration techniques, stochastic sampling techniques known as Monte 
Carlo (MC) integration have evolved as an alternative (see Fig. 1.2 for concept). As 
mentioned, the key idea embedded in the MC approach is to represent the required 
distribution as a set of random samples rather than a specific analytic function (e.g., 
Gaussian). As the number of samples becomes large, they provide an equivalent rep¬ 
resentation of the distribution enabling moments to be estimated directly. A functional 
estimate of the distribution could also be fit to the samples, if desired. 

Integration has been used throughout statistics to evaluate probabilities and 
expectations. However, with Monte Carlo techniques the process is reversed and 
expectations are used to calculate integrals [1, 16, 28, 32], Suppose we are asked to 
evaluate a multidimensional integral. 
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Then an MC approach would be to factorize the integrand as 

g(x) —> fix) pix) 9 pix) >0 and J p(x)dx = 1 

where p(x) is interpreted as a probability distribution in which samples can be drawn. 
This is the foundation of sampling techniques based on MC integration. Monte Carlo 
approaches draw samples from the required distribution and then form sample aver¬ 
ages to approximate the sought after distributions, that is, they map integrals to 
discrete sums. Thus, MC integration evaluates Eq. 3.26 by drawing samples, {X(i)} 
from Pr(X). Assuming perfect sampling, this produces the estimated or empirical 
distribution given by 


PrOT ~ X{i)) ^ 

which is a probability distribution of mass or weights, ^ and random variable or 
location X(i). Substituting the empirical distribution into the integral gives 

C 1 N 

Ex{f(X)} = J f(X)Pr(X) /(*(»')) = / (3.28) 

which follows directly from the sifting property of the delta function. Here/ is said 
to be a Monte Carlo estimate of ExfiX)}. Clearly, it is unbiased, since 

1 N 

E{f } = -Y^EUiXii))} = E x {f(X)} 

with variance given by 

var(/) = i J [f(X)-Ex[ f(X)}] 2 Pr(X)dX 

Additionally, if the variance is finite, then the central limit theorem holds and the 
error in estimation converges to a zero-mean, Gaussian with Af(0, var(/)). Consider 
the following examples to illustrate these concepts. First we discuss an analytic case, 
then a numerical case to solidify both MC approaches. 

Example 3.7 

Suppose we would like to solve for the integral 


/= [ f{x)dx 

Jo 
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using the MC approach. Let u be defined as a random variate drawn from a uniform 
distribution, u ~ U(0, 1), then we can express the integral in terms of an expectation as 

I = E{f(u)}= f f (w) Pr(n) du 
Jo 

Generating a set of independent, identically distributed ( i.i.d .) uniform samples, 
mi, ..., un, then the corresponding set of functions, {/'(«,)) are also i.i.d. with mean 
I as defined above. Therefore, the uniform sampling distribution is given by (equally 
likely) 

1 N 

Pr(i< = ui) = — Y S(u — Ui ) for 0 < w,- < 1 

N “ 

Substituting this expression into the expectation relation, we obtain 

,t f , -v -| l N 

I - E[f(u)} = j ^ f (u) |^- ^ S(u - ui )J du= f(Mi) 

From the Law of Large Numbers 3 as N oo, it follows that 
N 

ym) nf(u)} = i 

Therefore, we can approximate the integral by generating a large number of random 
variables drawn from a uniform sampling distribution, transform them according 
to some functional relationship and calculate the sample mean to approximate the 
desired integral. AAA 

This simple example illustrates the basic MC concept that will be applied through¬ 
out this text. Consider another example that illustrates this approach further by 
developing a simulation-based solution to a familiar statistical estimation problem. 

Example 3.8 

Suppose we have a Gaussian random variable and we would like to estimate its mean 
and variance. Knowledge that it is Gaussian enables us to write the closed form 
expression for the distribution 
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and the mean and the variance can be calculated analytically as 



a 2 x = J (x- m x f Pr(x) dx 

In contrast, the Monte Carlo approach is to generate N samples from a Gaussian. 
Assuming perfect sampling, x; —» AT(m x , a 2 ), and we have that 

1 N 

Prtx)^ -^«5(x-x«) 

Now the mean and variance can be estimated from the samples directly using 
"A = j xPv(x) dx = X *^J dx=X N H x ‘ 

with variance 

ol = f(x~ m x f Pr(x) dx = f (x-m x f S(x - x,)) dx = i £ (x, - m x f 

Summarizing the MC approach for this example, we must: 

• Generate N samples from a Gaussian: x, —> N(m x , a 2 ); 

• Estimate the desired statistics of the distribution from its samples as: m x and a 2 

We performed a simulation in MATLAB to generate Gaussian variates with m x = 2 
and a 2 — 4. The results are shown in Fig. 3.4. In Fig. 3.4a we see a simulation for 
N = 1000 samples (+) with corresponding estimated mean (solid line) and upper and 
lower 95%-confidence limits about the mean (m x ± 1.96(a). The sample mean and 
variance for this MC realization were at: m x = 1.97 and a 2 = 4.01. The distribution 
was estimated using a histogram with 100-bins and is shown in Fig. 3.4b with a near 
perfect MC solution using N = 10 6 samples shown in the inset. AAA 

So we see from this example how MC methods can be used to used to approximate 
(estimate) distributions and their associated statistics from simulated samples. 

These methods are acceptable as long as high accuracy in not required. Monte 
Carlo techniques tend to perform a “divide and conquer” approach to integration by 
breaking the integral up into distinct regions around the integrand consisting of strong 
local peaks at known locations. They are typically very time consuming methods, on 
the order of tens and hundreds of thousands of points, and have only recently gained 
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Sample No. 

(a) 



FIGURE 3.4 Monte Carlo approach for estimation of the statistics of a Gaussian random 
process: (a) Simulation (m x =2,cr% =4, N= 1000) with 95% confidence limits (dashed line) 
about m x (solid line), (b) Estimated distribution (histogram) from samples. Inset is estimate 
for N= 1 x 10 6 samples. 
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notable prominence in the signal processing literature due to the major advances in 
fast computers [28], Next let us consider the extension of Monte Carlo techniques 
using Markov chains. 

3.4.1 Markov Chains 

In MC integration, the population mean of f(X ) is estimated by the sample mean. 
When the samples pQ(f) = A)} are independent, the Law of Large Numbers ensures 
that the approximation can be made as accurately as required by increasing the num¬ 
ber of samples, N. Generally, however, drawing samples from Pr(X) is not feasible 
especially when it is a non-standard distribution. However, dependent samples can 
be generated by any process that draws samples throughout the support (range) of the 
distribution. One efficient technique to accomplish this sampling is through a Markov 
chain having Pr(X) as its unique stationary or invariant distribution —this methodol¬ 
ogy is termed Markov chain Monte Carlo ( MCMC ). This technique is basically MC 
integration where the random samples are produced using a Markov chain. Recall 
that a Markov chain is a discrete random process possessing the property that the 
conditional distribution at the present sample (given all of the past samples) depends 
only on the previous sample (first-order), that is, 

Pr(Xi(t)\Xj(t - 1),... ,X k (0)) = PriXfflXjd - 1)) 

Summarizing, the Markov chain dynamics are represented by the transition prob¬ 
ability, Vij(t —l,t):= P\iX(t) = Xi\X(t — 1) = Xj) denoting the probability that the 
state at time t will be Xj given that it is currently in state Xj at time t— 1. Further, 
if the chain is also homogeneous in time, then Vyit — 1, t) depends only on the time 
difference (in general) and therefore the transition probability is stationary such that 
Vij(t - U) -* Vij with Vij > 0 and V tj = 1 [1], 

Thus, the basic requirement in Monte Carlo techniques is to generate random 
samples from a probability distribution or target distribution only known up to a nor¬ 
malizing constant. Typically it is not possible to generate the samples from the target 
but generating from a known trial distribution that is similar to the target distribution, 
just as in rejection sampling, is applied. In order to understand the MCMC approach 
we must first define critical properties of the Markov chain. 

Markov chains possess certain crucial properties that must exist for them to be 
useful in MC simulations [32, 33]. A Markov chain begins with an initial distribution, 
Pr(X,(0)) and evolves to another indexed variable Xft) determined by a transition 
kernel, that is, at index t we have 


Pr(A,(f)) = Pr(W - 1)) x Pr(Xj(t - 1)) (3.29) 


An invariant distribution is a fixed point solution to Eq. 3.29. For distribution 
estimation the constraint is even more stringent, since we require time reversible 
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chains which must satisfy a detailed balance as (ignoring the index) 

Pr(X,(0) x Pr(Xi(t)\Xj(t)) = Pr (Xj(t)) x Pr(X ; (t)|X ; (0) V X t (t), Xj(t) (3.30) 

which means that the transition probability from Xj(t) to Xj(t) is identical to the 
probability from Xj(t) to X,(f) implying invariance [28, 32], The chain is also required 
to be ergodic which means that regardless of the initial distribution, the probability 
at t converges to the invariant distribution as t —» oo, that is, 

( lrm Pr (Xi(t)) —► Pr(X,-(/)) (3.31) 

Thus, MCMC methods for the simulation of a distribution is any technique pro¬ 
ducing an ergodic or reversible chain with invariant distribution being the desired 
target distribution. Armed with this information we can now discuss the most pow¬ 
erful and efficient MCMC methods: the Metropolis-Hastings and Gibbs sampler and 
their variants. 


3.4.2 Metropolis-Hastings Sampling 

Markov chain simulation is essentially a general technique based on generating sam¬ 
ples from proposal distributions and then correcting (acceptance or rejection) those 
samples to approximate a target posterior distribution. Here we must know both the 
target, p x (x) (up to a normalizing constant), and the proposal, q(x), a priori. The sam¬ 
ples are sequentially generated forming a Markov chain with properties defined in the 
previous section. Typically, in Markov chain simulation, samples are generated from 
the transition kernel or distribution. The key, however, is not really the chain itself, 
but the fact that the approximate distribution improves sequentially as it converges to 
the target posterior. 

In this subsection we discuss the basic Metropolis-Hastings sampling method. 
We start with the original Metropolis algorithm [24] and then introduce the Hastings 
generalization [23]. The fundamental idea is similar to the rejection method discussed 
previously. The Metropolis-Hastings ( M-H) technique defines a Markov chain such 
that a new sample, x,- is generated from the previous samples, x,_i, by first drawing a 
“candidate” sample, x, from a proposal distribution, q(x), and then making a decision 
whether this candidate should be accepted and retained or rejected and discarded 
using the previous sample as the new. If accepted, x, replaces x, (x,- —»■ x,) otherwise 
the old sample x,-_i is saved (x,_i —> x, ). This is the heart of the M-H approach in its 
simplest form. 

We start with the basic Metropolis technique to describe the method: 

• Initialize: x„ —> p x (x 0 ) 

• Generate a candidate sample from proposal: x,- —>■ q(x) 

• Calculate the acceptance probability: A(x,- i, x ( ) = min j , 1 J 
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• ACCEPT candidate sample with probability, A(Xi-\,Xi) according to: 

\ x ( ifp x (x i )>p x (x i - l ) 

Xi = , . (3.32) 

I Xi -1 otherwise 

We see from this technique that when the candidate sample probability is greater 
than the previous sample’s probability it is accepted with probability 


Px(xi) = 


A(x„ xi) if Accepted 
1 - A(x{,Xi) if Rejected 


(3.33) 


The idea is that we can detect when the chain has converged to its invariant distribu¬ 
tion (posterior), when p x (xi) = p z (x;_i). It is clear from this discussion that in order 
for the chain to converge to the posterior it must be reversible and therefore q(x) must 
be a symmetric distribution. This was the original Metropolis assumption. Hastings 
[23] generalized this technique by removing the symmetry constraint enabling asym¬ 
metric proposals. The basic algorithm remains the same except that the acceptance 
probability becomes (px —> p) 


A(Xi,Xi ) = min 


P (xi) x q(xi\xj) 1 
pUy) x q(x,|x,) ’ ( 


(3.34) 


This process continues until the desired A-samples of the Markov chain have been 
generated. The critical step required to show that the M-H converges to the invariant 
distribution or equivalently the posterior, p x (x), evolves directly from the detailed 
balance of Eq. 3.30 given by 


Pfet il-Tj) x p (x t ) = p(x,\x i+l ) x p(x/+i) (3.35) 


where p(x i+ i | xi) is the Markov chain transition probability. If we assume that the 
; r,! -sample was generated from the posterior, x ( - ~ p x {x), then it is also assumed that 
the chain has converged and all subsequent samples have the same posterior. Thus, 
we must show that the next sample, x,-+i is also distributed p(x,) = p x (x). Starting 
with the detailed balance definition above and integrating (summing) both sides with 
respect to x,-, it follows that 

J P(x i+ \ \Xi) p(x,) dXi = J p(xj |x,+ l) p(x,'+l) dxi 

= p(x/+i) J p(xi\x i+ i)dxi = p(x/+i) (3.36) 

which shows that the relation on the left side of Eq. 3.36 gives the marginal distribution 
of Xj+ 1 assuming x; is from p(x,). This implies that if the assumption that x,- is from 
p(xj) is true, then x,+i must also be from p(x,) and therefore p(x,) ^ p x (x) is the 
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invariant distribution of the chain. Thus, once a sample is obtained from the invariant 
distribution, all subsequent samples will be from that distribution as well, proving 
that the invariant distribution is p(x,j. A full proof of the M-H technique requires a 
proof that p(x,|xo) will converge to the invariant distribution (see [19] for details). 
For the M-H technique, the transition from x,+i ^ x; to x; occurs with probability 

P(x;+i|x/) = q(x,+i|x,-) x min i x ij = xL(x„x; +l ) x q(x,+ i |x,) 

I p(x,) x q(x,+ i |x,) J 

(3.37) 

This completes the basic M-H theory. 


3.4.3 Random Walk Metropolis-Hastings Sampling 

Next we consider another version of the M-H technique based on a random walk 
search through the parameter space. The idea is to perturb the current sample, x,- with 
an addition of a random error, that is, 

Xi +1 = Xj + €j for e, ~ p £ (e) (3.38) 

where e is i.i.d. A reasonable choice for this distribution is a symmetric Gaussian, 
that is, p £ (e) ~ A/"(0, rrr). Thus, the random walk M-H method is: 

Given the current sample, x,-, 


• Draw a random sample: e —*■ p £ (e) 

• Generate the candidate sample: x,- = x; + e, 

• Draw a uniform random sample: k,- —»■ W(0,1) 

• Calculate the acceptance probability from the known densities: _4.(x,-,x,) 

• Update the sample: 


I X,- if M, < x4(x,',X,') 
x, Otherwise 


(3.39) 


• Select the next sample 


With this algorithm, we must use both the (known) proposal and target distri¬ 
butions to calculate the acceptance probability and then generate samples (random 
walk) from the proposal. It is important to realize that a “good” proposal distri¬ 
bution can assure generating samples from the desired target distribution, but the 
samples must still be generated to “cover” its range. This is illustrated in Fig. 3.5 
where our target distribution is a Gaussian mixture with mixing coefficients as: 
(0.3,^(0,2.5); 0.5,A/ r (5,1); 0.2,^(10,2.5)). In the figure we see the results from 
choosing a reasonable proposal (A/ r (0,100)) in the dark color generating enough sam¬ 
ples to cover the range of the target distribution and a proposal (Af( 0,10)) that does not 
sample the entire space adequately leading to an erroneous target distribution char¬ 
acterizing the sampler output. Consider the following example of applying a variety 
of proposals and M-H techniques. 
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FIGURE 3.5 Metropolis-Hastings for Gaussian mixture distribution PDF estimates: (a) Inad¬ 
equate proposal: AT(0,10) at a 59.2% acceptance rate (light gray), (b) Adequate 
proposal: AT(0,100) at a 34.0% acceptance rate (dark gray). 


Example 3.9 

Suppose we would like to generate samples from a unit Gaussian distribution, 
Af(0, 1), using the various M-H simulation techniques: (i) symmetric proposal using 
U{— 10,10); (ii) symmetric proposal using a Student T distribution, T(l); (iii) ran¬ 
dom walk using uniform noise, U(— 5,5); and (iv) uniform proposal using U{— 4,9). 
To start we specify the “target” distribution as p x (x)~Af(Q,1) and we choose the 
following proposals: 

. Case (i): qi (jt)~Z/(-10,10) =► ± for-10 <x < 10 

• Case (ii): q 2 (x) ~ T(l) 

• Case (iii): x; = x,-_i + m; where u ( - ~ IA(— 5,5) ^ for —5 < m, < 5 

• Case (iv): q 4 (x)~ZY(—4,9)=^ ^ for —4 <x < 9 

To implement the Metropolis, Metropolis-Hastings, Random Walk Metropolis- 
Hastings techniques we must: 

1. Draw a random sample from the proposal: x; —> q,(x) 

2. Calculate the Acceptance Ratio: .A(x ( -,x;_i) 

3. Draw a uniform sample: M; —>■ U(0, 1) 

4. Accept or reject sample: m, < A(x/,x,- i) 
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5. Update sample: x,- 

6. Generate next sample: x,-+1 

The results of these 10 5 -sample simulations are shown in Fig. 3.6a-d using the 
M-H- sampler in MATLAB. We see the results of using the various proposals. All of the 
estimated densities give reasonably close estimates of the target posterior, Af( 0,1). 
We estimated the corresponding posterior distribution from the samples using both 
histogram and kernel density estimators of Sec. 3.2 with all of the results reasonable. 
The standard deviation estimates were very close to unity in all cases; however, 
the means differed slightly from zero. It is interesting to note the corresponding 
acceptance rates (see the figure caption) with the most Gaussian-like, T proposal 
distribution had the highest acceptance rate of 57.8%. The true target distribution 
is superimposed for comparison purposes. Acceptance rates provide an indication 
of how “probable” a sample from the proposal is accepted as a sample in the target 
(posterior). When the proposal provides good coverage of the target distribution, 
then many samples are accepted at a high rate, if not the rate is low. Clearly, the 
M-H sampling technique and its variants provide a very robust method for generating 
samples from a target distribution especially when the proposal covers the entire range 
of the target sample space. AAA 

There are a variety of M-H sampling techniques such as the independence sampler, 
the hybrid or dynamic (Hamiltonian) sampler, the multipoint samplers, etc. [32], 
Also note that many of these methods are available in freeware (e.g. language-based 
BUGS [38] and the MATLAB- based, NETLAB [39], PRTools [40]). We summarize the 
M-H sampling technique is Table. 3.1. Next we discuss a popular special case of this 
approach—the Gibbs sampler. 

3.4.4 Gibbs Sampling 

The Gibbs simulation-based sampler ( G-S ), one of the most flexible of the sampling 
techniques available. It is a special case of the Metropolis-Hastings approach in which 
the acceptance probability, A(xt,Xi), is unity, that is, all samples are accepted [27]. 
Theoretically, the G-S is based on the fact that the targeted joint posterior distribution 
can be determined completely by a set of underlying conditional distributions evolving 
directly from Bayes’ rule (joint = conditional x marginal) [29]. It falls into the class 
of sampling algorithms termed, block-at-a-time or variable-at-a-time methods [23]. 
Proof of these methods have a significant practical implication, since it can be shown 
that the product of the transition distribution of the Markov chain is a product of 
conditional transitions which converge to joint posterior as its invariant distribution 
[30]. It is for this reason that the G-S is called “sampling-by-conditioning” [28], which 
will become obvious after we investigate its underlying structure. As before, it should 
be realized that both target and proposal distributions must be known (approximately) 
and samples must be easily generated from the proposal to be effective. 

Gibbs sampling can be considered an implementation of the M-H technique on 
a component-by-component basis of a random vector. It is more restrictive than the 
M-H method, but can be more efficient leading to a simpler implementation. The 
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TABLE 3.1 Metropolis-Hastings Sampling Algorithm 


Initialize 


x„ p x (x<,) 


Xi q(x) 


[draw sample] 

Proposal 

[draw sample] 


A{x h Xi)= 


Acceptance probability 
P&) x q(x,|x,) 1 

p(x;)xq(x,|x;)’ J 


u k —► U{ 0,1) 


Uniform sample 

[draw sample] 


Decision 

I X; A U k < A(Xi,Xj) 
x;_i otherwise 


Next sample 

[draw sample] 


G-S is especially important in Bayesian problems, since it is uniquely designed for 
multivariate problems, that is, it replaces sampling from a high-dimensional vector 
with sampling from low order component blocks [67], It can be considered a concati- 
nation of M-H samplers, one for each component variable of the random vector. This 
decomposition has individual target distributions representing a conditional density 
or mass for each component given values for all of the other variables. Thus, the 
proposal for the component variable of the vector is the conditional density of that 
variable given the most recent values for all of the others. 

More formally, suppose the random vector, X e 1Z Nx x 1 is decomposed into its 
components, X k for k = 1,..., N x . Therefore, the idea is to generate, say X] (i), based 
on the conditional distribution, Pr(Xi \X2(i — 1),..., X^Ji — 1), Y ) and the next sam¬ 
ple drawn, X 2 (i), uses it and the samples available from the previous iteration to 
sample from, Pr(X 2 \Xi(i) UX 2 (i — 1),... ,X N Ji — 1), T) and so forth so that at the 
i th -iteration, we have the k th component sample generated from 

X k (i) —► Pr(X k \{X n (i)} U {X m (i - 1 )} b m > k;n < k,Y) (3.40) 

If we expand this relation, then we observe the underlying structure of the Gibbs 
sampler. 

Given the sample set:, {X(; — 1)}, then 

• Generate the first sample: X\ ->■ Pr(Xi \X 2 (i — 1),..., X^Ji — 1), Y) and then 

• Generate the next sample: X 2 -* Pr(X 2 |Xi(i) UX 3 (i - 1),.. .,X Nx (i - 1), F) 

• Generate the ^"'-sample: X k Pr(X k \{X n - k ,.. .,X n (i)}, {X Nm+k (i - 1),..., 
X m+k - Nx }3m>t,n<k, Y) 
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So we see that the vector is decomposed component-wise and the corresponding 
conditional distributions evolve creating the vector sequence of iterates which are the 
realization of a Markov chain with transition distribution. We assume that we would 
like to go from X' —»■ X giving the transition probability: 


Pr(X',X) = Pr(Aj |_S£,.. ,,X' Nx , Y) x Pr(X 2 |Xi,X^,... ,X' Nx , Y) x • • • 

x Pr(X^|Xi,..., ... ,X' Nx , P) (3.41) 

The G-S can be shown to satisfy the detailed balance. As a result it converges to 
the invariant distribution of the Markov chain which in this case is the joint posterior 
distribution [28, 32], Consider the following example from [20], 

Example 3.10 

Suppose we have a measurement vector, y from a bivariate Gaussian with unknown 
mean and known covariance, that is, 

Pr(X|Y) = AT( Y, R n ) for R vt = ^ J J (3.42) 

To apply the G-S to X, we require the conditional posterior from the well-known 
properties of the bivariate Gaussian [1] given by 

Pr(Xi|X 2 , Y) ~ M(y\ + p(x 2 - y 2 ), 1 - p 2 ) 

Pr(X 2 |Xi, Y) ~ M(y 2 + p(xi - yi), 1 - p 2 ) 

Thus, the G-S proceeds by alternating samples from these Gaussian distributions, 
given (xi(0),x 2 (0)) 

Let i= 1,2,... 

• Draw x\(i) -> Pr(x, \x 2 (i - 1), Y) 

. Draw x 2 (i) Pr(x 2 |x, (i), Y) 


So we generate the pairs: (xi(l ),x 2 (l)), (xi(2),x 2 (2)),..., {x\(i),x 2 (i)) from the 
sampling distribution converging to the joint bivariate Gaussian, Pr(X) as the invariant 
distribution of the Markov chain. AAA 


Next we consider a generalized variant of the Gibbs’s sampler—the slice sampler. 

3.4.5 Slice Sampling 

Slice sampling ( S-S ) is a MCMC sampling technique based on the premise of sampling 
uniformly from the region under the target distribution, Pr(X) [19, 31, 32, 36, 37], 
Therefore, like all of the previous sampling methods, it can be applied to any problem 
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for which the target distribution can be evaluated at a point, say x. It has an advantage 
over Metropolis methods being more robust to step-size variations especially since 
it performs adaptive adjustments. Slice sampling (S-S) is a generalized case of the 
Gibbs sampler based on iterative simulations of the uniform distribution consisting 
of one dimensional transitions. It is also similar to rejection sampling in that it draws 
samples from under the target distribution. However, in all of these cases the S-S is 
not bound by their restriction [37], It is based on the following proposition proven in 
Cappe [36]. 

Proposition: Let x~Pr(X) and w~Z7(0,M), then the pair of random variables, 
(x, u) are distributed uniformly as, (x, u) ~Z7(0, M x Pr(X)). Conversely, if (x, u) 
is uniformly distributed, then x admits Pr(X) as its marginal distribution. 

In its simplest form, the slice sampler technique generates uniform intervals that 
capture the samples: 

• Initialize: x,_i 

• Draw uniform samples: m ( - —> IA( 0, Pr(x,_i)) 

• Draw uniform samples: x,- —> U(0, S(ui )) where S(ui) = {x : Pr(x,) > m;) 

The actual implementation of the algorithm is much more complex and we refer 
the interested reader to Neal [31] or MacKay [37] for more details. We merely state 
important features of the slice sampler technique. 

The S-S involves establishing intervals to ensure the sample points of the target 
distribution are included by using an adaptive step-size (interval size) applying two 
techniques: (1) step-out technique; and (2) shrinking technique. The step-out tech¬ 
nique is used to increase the size of the interval until the new sample, x, is included, 
while the shrinking technique does the opposite, it decreases the interval size to assure 
the original sample x,_i is included. Consider the following example from [19] to 
illustrate the S-S. 

Example 3.11 

Suppose we would like to generate samples from a unit Gaussian distribution, 
x ~7V(0,1) using the slice sampling technique: 

• Initialize: x,_i = 0 (random draw) 

• Draw uniform samples: u, —> U( 0, Pr(x,_i)) = U (o, -^=fr* 2/2 ) 

• Draw uniform samples: x,-—> U(—a, a) where a — \[—\nV2nuj 

Simulations can now be performed and analyzed. AAA 

We conclude this section with the a signal processing example. We are given 
an autoregressive (all-pole) model and we would like to generate samples from the 
corresponding posterior distribution. 



robability Amplitude 
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Samples 


FIGURE 3.7 Gibbs (slice) sampler for autoregressive (A/?(l)) model: (a) Simulated sam¬ 
ples (N=10 5 ). (b) Estimated posterior distribution: AT(0.002, 0.1003) with 10% burn-in 
(/V= lO^-samples. 


Example 3.12 

We have the following all-pole (AR( ] )) model: 

x(t) = —ax(t -l)+w(t-l) for w ~ 0, R ww ) 

Suppose we choose the following parameters for the model: N = 10 5 samples, 
a = 0.1, R ww = 0.1 and we would like to develop samples from the posterior. Ana¬ 
lytically, we know that the posterior distribution is given by: x~Af( 0, — 

■AA(0,0.101). We generate the samples using the slice (Gibbs) sampler with the 
proposal: q(x)~Af(0, 0.1). The results are shown in Fig. 3.7. Using MATLAB we 
synthesized the set of samples and estimated their underlying distribution using both 
histogram and kernel density with a Gaussian window estimators. The samples includ¬ 
ing a 10% burn-in period are shown in Fig. 3.7a along with the estimated distribution 
in b. The sampler has a 100%-acceptance rate and the estimates are quite good with 
the posterior estimated at A/’(0.002,0.1003). AAA 

This concludes the section on sampling theory and iterative sampling techniques. 
Next we investigate the importance sampler that will lead to the sequential approaches 
required to construct Bayesian model-based processors. 
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3.5 IMPORTANCE SAMPLING 

One way to mitigate difficulties with the inability to directly sample from a posterior 
distribution is based on the concept of importance sampling. Importance sampling 
is a method to compute expectations with respect to one distribution using random 
samples drawn from another. That is, it is a method of simulating samples from a 
proposal distribution to be used to approximate a targeted (posterior) distribution by 
appropriate weighting. Importance sampling is a generalization of the MC approach 
which evolves by rewriting Eq. 3.27 as: 

'f — J f(x)dx = J x q(x)dx for J q(x)dx= 1 (3.43) 

Here qix) is referred to as the sampling distribution or more appropriately the 
importance sampling distribution, since it samples the target distribution, fix) non- 
uniformly giving “more importance” to some values of fix) than others. We say that 
the support of q(x) covers that of/(x), or the samples drawn from q(-) overlap the 
same region (or more) corresponding to the samples of/(-) as illustrated previously 
in Fig. 3.3. That is, we say that/(x) and qix) have the same support if 

f{x) > 0 => qix) > 0 Vx e R n * xl 

a necessary condition for importance sampling to hold. If we interpret the prior of 
Fig. 1.1 as the proposal, q(x) and the posterior as the target, fix), then this figure 
provides a visual example of coverage. 

The integral in Eq. 3.43 can be estimated by: 

• drawing N -samples from q(x): X(i) ~ qix) and qix) ~ jj $( x ~ X(i)); an< 3 

• computing the sample mean [28], 



It is interesting to note that the MC approach provides an unbiased estimator with the 
corresponding error variance easily calculated from the above relation. 

Consider the case where we would like to estimate the expectation of the function 
of X given by/(X). Then choosing an importance distribution, q(x), that is similar to 
fix) with covering support gives the expectation estimator 


Epifix)} = jjix) x p(x)dx = j^f(x) x 


qix) dx 


(3.44) 
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If we draw samples, {X(i)}, i = 0,\.... ,N from the importance distribution, q(x) and 
compute the sample mean, then we obtain the importance sampling estimator. That 
is, assume the perfect sampler, q(x) ~ ^ ~ -X"(0)» and substitute 


E p {f{x)} = jj(x) x q(x)dx * ± £/(*(0) 


W0) \ 

,<7(X(i))/ 


(3.45) 


demonstrating the concept. 

The “art” in importance sampling is in choosing the importance distribution, <?(•) 
that approximates the target distribution, p(f, as closely as possible. This is the princi¬ 
pal factor effecting performance of this approach, since variates must be drawn from 
q(x) that cover the target distribution. Using the concepts of importance sampling, 
we can approximate the posterior distribution with a function on a finite discrete 
support. Since it is usually not possible to sample directly from the posterior, we use 
importance sampling coupled with an easy to sample proposal distribution, q(X, \ Y t )— 
this is the crucial choice and design step required in Bayesian importance sampling 
methodology. Here X t = (x(0),..., x(t)} represents the set of dynamic variables and 
Y t = (y(0),... ,>’(?)), the set of measured data. Therefore, starting with a function of 
the set of variables, say f(X t ), we would like to estimate its mean using the importance 
concept. That is, using the MC approach, we would like to sample from this posterior 
directly and then use sample statistics to perform the estimation. Therefore we insert 
the proposal importance distribution, q(X t \ Y t ) as before 

f(t) := E{f(X t )} = J fiX t ) [^y^y] x q{X,\ Y t ) dX t (3.46) 


Now applying Bayes’ rule to the posterior target distribution, and defining a weighting 
function as 


Pr(^l Y t ) = Pv(Y,\X t ) x Pr(X ; ) 
qiX t \Y t ) Pr(T ; ) x q(X t \Y t ) 


(3.47) 


Unfortunately, W(t) is not useful because it requires knowledge of the evidence or 
normalizing constant Pr(7 r ) given by 


Pr(Tf) = j Pr(UI^) x Pr(X t )dX t 


(3.48) 


which is usually not available. But by substituting Eq. 3.47 into Eq. 3.46 and defining 
a new weight, W(t), as 


Pr(X,\Y t ) 

q(X,\Y, 


PdY,\X t ) x Pr jX t ) 


q(Xt\Y t ) 


(3.49) 
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we obtain 


• my t )! j L q& t \Y t ) \ 

= p^Fj/ W{t)f{X t )q(X t \Y t )dX, 


which is simply the expectation of the weighted function, E q {W(t)f(X t )} scaled by the 
normalizing constant. From this definition of the new weighting function in Eq. 3.49, 
we have 


W(t) x q(X, | Y t ) = Pr (Y t \X t ) x Pr(X r ) (3.51) 

Thus, we can now replace the troublesome normalizing constant of Eq. 3.48 using 
Eq. 3.51, that is, 

m - E « {W(t)f(X < )] _ _ E q {W(t)f(X t )} 

1 Pr(Tr) / W(t) x q{X t \Y,)dX t E q {W(t)} ' 

Now drawing samples from the proposal X t (i) ~ q(X t \ Y t ) and using the MC 
approach leads to the desired result. That is, from the “perfect” sampling distribution, 
we have that 

1 N 

q(X t \Y t )« - 8(X t - X t (i)) (3.53) 

and therefore substituting, applying the sifting property of the Dirac delta function 
and defining the “normalized” weights 


W,(f) := 




= Yr(Y,\X t (i)) x Pr(X t (D) 
q(.X t (i)\Y t ) 


we obtain the final estimate 


(3.54) 


N 

fit) « W ‘ (t) x f (X ‘ (i » (3-55) 


The importance estimator is biased being the ratio of two sample estimators (as in 
Eq. 3.52), but it can be shown that it asymptotically converges to the true statistic 
and the central limit theorem holds [32,49], Thus, as the number of samples increase 
(N —> oo), an asymptotically optimal estimate of the posterior is 

N 

?r(X t Y,) ^ Y, x S(X ‘ ~ ^(0) (3.56) 


which is the goal of Bayesian estimation. Note that the new weight is, W(t) a W(t) 
where a is defined as “proportional to” up to a normalizing constant. 
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3.6 SEQUENTIAL IMPORTANCE SAMPLING 

Suppose we would like to develop a sequential version [41-60] of the batch Bayesian 
importance sampling estimator of the previous section. The importance distribution 
can be modified to enable a sequential estimation of the desired posterior distribution, 
that is, we estimate the posterior, Pr(X r _i | T r _i) using importance weights, W(t — 1). 
As a new sample becomes available, we estimate the new weight, W(f) leading to an 
updated estimate of the posterior, Pv(X t \Y t ). This means that in order to obtain the new 
set of samples, X,(i) ~ q(X, \ Y t ) sequentially, we must use the previous set of samples, 
X t -\ (i) ~ q(X t -i\Y t -i). Thus, with this in mind, the importance distribution, q(X t \ Y,) 
must admit a marginal distribution q(X t _ \ | K,_|) implying the following Bayesian 
factorization 


q{X t \Y t ) = q(X t -\\Y t -{) x q(x(t)\X t _ u Y t ) (3.57) 

which satisfies the probability chain rule decomposition 


q(X t \Y t ) = f] q{x(k)\X t -k, Y k ) 


(3.58) 


Now let us see how this type of importance distribution choice can lead to the 
desired sequential solution [58], We start with the definition of the unnormalized 
weight of Eq. 3.49 and substitute into Eq. 3.57 while applying Bayes’ rule to the 
numerator. The resulting weighting function is 


W(t) = 


Pr(T f |Z f ) x Pr(Z f ) 
qQC^Y,^) x q(x{t)\X t _ u Y t ) 


(3.59) 


Motivated by the definition of W(t— 1), we multiply and divide by the Bayesian 
factor Px(Y t -\\X t -i) x Pr(X f _i) and group to obtain 


r Pr(y,-i|X,-i) x Pr(W_i) - 

L g(x t -i|r,-i) 

Previous Weight 

_ Pr(T f |X f ) x Pv(X t ) _ 

X [Prf K f ! \X t -\ ) x Pr(X t -\ )] x q(x(t)\X t -i, Y t ) 


Therefore we can write 


W(t) = W(t- 1) 


_ Vr(Y,\X t ) x Pr(Z f ) _ 

Pr(F,_i|X f _i) x Pr(Z f _i) x q(x(t)\X t - U Y t ) 


(3.60) 


Using the probabilistic chain rule of Eq. 2.63 for each of the conditionals and 
imposing the Markov property of the dynamic variable along with the conditional 
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independence conditions of the measurements, we obtain 


Pr(Y t \X t ) = Y\ Pr(y(k)\x(k)) = Pr(y(t)\x(t)) f] P<y(k)\x(k)) 
k= 0 k=0 

Pr(X t ) = f] Pr(x(k)\x(k - 1)) = Pr(x(t)\x(t - 1)) [~[ Pr(x(k)\x(k - 1)) 
k =0 k =0 

(3.61) 


Therefore, recognizing the relationship between these expansions as well as those at 
t — I and factoring the t-th term (as shown above), we can cancel both numerator and 
denominator chains to extract the final recursions, that is, 

Pr(y(t)\x(t)) \nto Pr(y(k)\x(k))] 

W(t) = W(t - 1) x -=-± 1 

Pi Vr(y(k)\x(k))^ 

Pr(x(t)\x(t - 1)) [nl-1, Pr(x(k)\x(k - l))j j 

fnrJo Pr(x(k)\x(k - 1))] q(x(t)\X,-u Y t ) 

(3.62) 


which gives the final recursion 

W(t) = Wit - 1) X Pr (y( f )l*(0) X Pr(x(f)lx(f - 1)) 
q(x(t)\X,- U Y t ) 


(3.63) 


Another way of developing this relationship is to recall the Bayesian solution to 
the batch posterior estimation problem in Eq. 2.79. We have 


Pr(X t \Y t ) = 


-Pr(y(t)\x(t)) x Pr(x(t)\x(t - 1))' 
Pr(y(Oin-i) 


Pr (X r _i|F ( _i) 


or recognizing the denominator as just the evidence or normalizing distribution and 
not a function of X t , we have 


Pr(X f | Y t ) = C x Pr(y(t) WO) x Pr(xU)\x(t - 1)) x Pr(Z ? _i |F f _i) (3.64) 


or simply 


Pi(X t \Y t ) oc Pr(y(t)\x(t)) x Pr(x(t)\x(t - 1)) x Pr(X r _j | Y, _i) 


(3.65) 
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Substituting this expression for the posterior in the weight relation as before, we have 
W(t) ?r(Y,\X t ) = Pr(y(t)|x(f))xPr(x(Q|x(t-l)) x Pr(Z f _i |y f _i) 

U0C q(X t \Y t ) qWf)\X t -i,Y t ) X g (X f _i|F ? _i) ' j 

Previous Weight 

giving the desired expression of Eq. 3.63 

These results enable us to formulate a generic Bayesian sequential importance 
sampling algorithm: 

1. Draw samples from the proposed importance distribution: Xj(t) — > 

q(x(f)\X t -\, Y t ); 

2. Determine the required conditional distributions: Pr(xj(t)\x(t — 1)), Pr( y(f) |x,(Y)); 

3. Calculate the unnormalized weights: Wi(t ) using Eq. 3.63 with x(t) —> x,(f); 

4. Normalize the weights: W,(t) in Eq. 3.54; and 

5. Estimate the posterior distribution: Pr(X r | Y,) = VV,(f)c>(x(f) — x,(f)) 

Once the posterior is estimated, then the desired statistics evolve directly. We 
summarize the generic sequential importance sampling in Table 3.2. 


TABLE 3.2 Bayesian Sequential Importance Sampling Algorithm 


Initialize 

f (0)~ 9 Cc(0)|y(0)>; i=4. N p 

Pr(y(0)|x,(0)) x Pr(x,(0)) 
Pr(x,(0)|_y(0)) 

W,(0) 


W l (0)=- 


W,(0) 


£2i w,(0) 


[sample prior] 
[weights] 
[normalize] 


Importance sampling 
Sample 

q(x(t)\X t _ u Y t y, i=\ . N p 


Weight Update 

W . m _ W(t _ d x PrCv(t)k,(Q)xPr(x(t)|x,(t)) 

qwm.it - I ), Y,) 


[sample] 


[weights] 


mr>■■ 


_ j W,(t) 
'&1 Wi(t) 


Weight Normalization 


Distribution 


Pr(x(t)\Y,) « YYi(t)S(x(t) — Xj(t)) [posterior distribution] 
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Introducing these ideas of Bayesian importance sampling, we are now ready to con¬ 
sider applying this approach to variety of models which we discuss in the next chapter. 


3.7 SUMMARY 

In this chapter we discussed the importance of simulation-based sampling methods 
for nonlinear signal processing. Starting with the basics of PDF estimation and sta¬ 
tistical sampling theory, we motivated statistical approaches to sampling both when 
we have analytical expressions for the distributions and we do not have them and 
must resort to pure sampling methodologies. We discussed the uniform sampling and 
rejection sampling methods examining their inherent advantages and disadvantages. 
We showed how these approaches led to more sophisticated techniques evolving from 
Markov chain theory and leading to the Metropolis-Hastings sampler. In certain cases 
the Gibbs sampler, a variant of the Metropolis-Hastings approach, was developed and 
discussed along with its variant—the slice sampler. All of these methods fall into the 
general class of iterative methods. Next we concentrated on the importance sampling 
approach leading to its recursive version—the sequential importance sampler which 
is the workhorse of this text. 


MATLAB NOTES 

MATLAB is command oriented vector-matrix package with a simple yet effective 
command language featuring a wide variety of embedded C language constructs 
making it ideal for signal processing applications and graphics. MATLAB has a 
Statistics Toolbox that incorporates a large suite of PDFs and CDFs as well as 
“inverse” CDF functions ideal for simulation-based algorithms. The mhsample 
command incorporate the Metropolis, Metropolis-Hastings and Metropolis inde¬ 
pendence samplers in a single command while the Gibbs sampling approach is 
adequately represented by the more efficient slice sampler (slicesample). There 
are even specific “tools” for sampling as well as the inverse CDF method captured 
in the randsample command. PDF estimators include the usual histogram (hist) 
as well as the sophisticated kernel density estimator (ksdensity) offering a vari¬ 
ety of kernel (window) functions (Gaussian, etc.) and ICDF techniques. As yet 
no sequential algorithms are available. Type help stats in MATLAB to get more 
details or go to the Math Works website. 
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PROBLEMS 


3.1 Let x be a discrete random variable with probability mass function ( PMF) 


Px(x) = 




* = 0,1,2,3 


What is the PMF of y, p y (v), when y = x 2 l 
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3.2 Let z be the distance from the origin to a random point selected within the unit 
circle ( x 2 + y 2 < 1). Let z = x 2 + y 2 , then 

(a) What is the probability of the selected point lying within the unit circle? 

(b) What is the CDF of z? What is its PDF? 

(c) Suppose w = z 2 , what is p w (w)1 Pwffo)? 

3.3 Letx~ZY(0,1) and y = —21nx, what is p x (x)? 

3.4 Suppose we have a bivariate distribution with point (A, 7) from a unit square 
such that 

P xr( x >y) =? iv 0 < x < 1, 0<y<l 
Let z = x + y, what is the CDF of z? PDF of z? 

3.5 A source emits particles that decay at a distance x (x ~ £xp(),)) from the 
source. The measurement instrument can only observe these events in a win¬ 
dow of length 20 cm (x= 1 — 20 cm). N decays are observed at locations: 
{x,} = {1,... ,N}. Using Bayes’ rule [37] 

(a) What is the characteristic length X? 

(b) Plot the distributions for {x,J = {1.5,2,3,4,5,12}. 

3.6 For a particular television show, a contestant is given the following instruc¬ 
tions: 

• There are three doors labeled 1,2,3 with a prize hidden behind one of them. 
Select one of the doors, but it will NOT be opened. 

• The host will open one of the other two doors, but will NOT reveal the prize 
should it be there. 

• The contestant must now make a decision to keep his original choice or 
choose another door. 

• All the doors will then be opened and the prize revealed 
What should the contestant do? 

(a) Stay with the original choice? 

(b) Switch to the remaining door? 

(c) Does it make any difference? 

(Hint: Use Bayes’ rule to answer these questions) 

3.7 Suppose we would like to simulate a random variable X such that 
Pr(A = i) = {0.2,0.15,0.25,0.40}. 

(a) Sketch out an algorithm using the inverse transform method to generate 
realizations of X choosing an ascending approach for X — 1, X = 2, X = 3 
and X = 4. 

(b) Sketch out an algorithm using the inverse transform method to generate 
realizations of A choosing an descending approach for A = 4, A = 3, A = 1 
and A = 2. 

(c) Which approach is more efficient? Why? 



92 SIMULATION-BASED BAYESIAN METHODS 

3.8 We would like to simulate the value of a discrete random variable X with 

associated probabilities: Pr(X = i) = {0.11,0.12,0.09,0.08,0.12,0.10,0.09, 
0.09,0.10,0.10}. Using the rejection method with M = max for 

W(0,10) 

(a) Sketch out rejection sampling algorithm for this problem. 

(b) Using this approach synthesize 1000 samples and estimate the histogram? 
Estimate the kernel density? (Hint: MATLAB has the commands hist and 
ksdensity to perform these operations). 

(c) Does the estimated distribution appear to be any classical closed form (e.g. 
Poisson)? If so which one? 

3.9 Suppose the continuous random variable X has cumulative distribution, 
P x (x) = x k . Using the inverse transform approach, sketch the methodology to 
synthesize X. How would you do this using the rejection approach? Generate 
1000 samples and estimate the distributions. 

3.10 Use the rejection sampling method to generate the continuous random variable 
x with density p x (x) = 20(x( 1 — x) 3 , 0 < x < 1. (Hint: select to be q(x) — 1 
and find the maximum ratio of the densities to obtain M). 

3.11 We would like to compute the solution to the integral 

X=/ f(x) dx 

Using the results of Ex. 3.14, develop the more general MC solution. 

3.12 Suppose we have a bivariate Gaussian distribution with unknown mean 
/x = [/x i fx.2]' and known covariance matrix 



and a uniform prior on ji. A single observation (>’i ,>’2) then has Gaussian 
posterior: 

Pr(/z| Y) ~ AT(Y, C ) for Y = \y\ y 2 ]' 

Sketch out the Gibbs sampler algorithm for this problem. 

3.13 Develop the Metropolis algorithm for a bivariate unit Gaussian target distri¬ 
bution, A/"(0 :0, /)) with prior Pr(© 0 ) (e.g. 14(0, 1)). The proposal distribution 
is bivariate Gaussian also, Af(&* : 0, (1/5) 2 /)). 

3.14 We would like to use MATLAB to simulate the Metropolis-Hastings sampler 
(mhsample) for the following case with the target distribution being a stan¬ 
dard Gaussian, Af( 5,1) and the proposal Rayleigh distributed with parameter, 
b= 1.5. Use a 10% sample bum-in. Compare these results to the slice sampler 
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(slicesample). Use (ksdensity) to estimate the distribution of the resulting 
samples for each algorithm for the comparison. 

3.15 Develop the Metropolis-Hastings sampler for a 2 w/ -order autoregressive model 
(AR(2)) with known coefficients, {a\, <22} = {1, —0.5} driven by Gaussian noise 
with e~jV(0,1). 

(a) Develop the exact likelihood for the parameters, Pr(7|a, a 2 ). 

(b) What is the posterior distribution for the parameters, Pr(a, n 2 \Y)1 (Hint: 
Assume the prior is just an indicator function) 

(c) Develop the M-H-sampler for this problem. 

3.16 Suppose we would like to generate samples from a bivariate Gaussian with 
mean vector zero and covariance 



(a) Choose a uniform proposal: U(— 3,3) and develop the M-H -sampler algo¬ 
rithm with a 5% burn-in and N= 10,000 samples. (Hint: Use MATLAB 
mhsample command). 

(b) Using the MC-approach, estimate the expected value: E{f(X)} = [1 1]X 

(c) Compare these results to those obtained using the slice-sampler. (Hint: 
Use MATLAB slicesample command). 

3.17 Set up the Gibbs sampler (G-S) for a joint (X, Y) exponential distribution on 
an interval of length (0, /). Estimate the marginal distribution of X, Pr(X), and 
compare the results to a simulated data set based on N = 500 samples. 




4 


STATE-SPACE MODELS FOR 
BAYESIAN PROCESSING 


4.1 INTRODUCTION 

In this chapter we investigate the development of models for Bayesian estimation 
[1-15] using primarily the state-space representation—a versatile and robust model 
especially for random signals. We start with the definition of state and the basic 
principles underlying these characterizations and then show how they are incorpo¬ 
rated as propagation distributions for Bayesian processors in the following chapter. 
We review the basics of state-space model development with all of their associated 
properties starting with the continuous-time processes, then sampled-data systems 
and finally proceeding to the discrete-time state-space. Next we develop the stochas¬ 
tic version leading to Gauss-Markov representations when the models are driven 
by white noise and then proceed to the nonlinear case [6-15]. Here we again drive 
the models with white Gaussian noise, but the results are not necessarily Gaussian. 
We develop linearization techniques based on Taylor-series expansions to arrive at 
linearized Gauss-Markov models. 

State-space models are easily generalized to multichannel, nonstationary, and non¬ 
linear processes. They are very popular for model-based signal processing primarily 
because most physical phenomena modeled by mathematical relations naturally occur 
in state-space form (see [13] for details). With this motivation in mind, let us proceed 
to investigate the state-space representation in a more general form to at least “touch” 
on its inherent richness. We start with continuous-time systems and then proceed to 
the sampled-data followed by the discrete-time representation—the primary focus of 
this text. We begin by formally defining the concept of state [1]. 
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4.2 CONTINUOUS-TIME STATE-SPACE MODELS 

The state of a system at time t is the “minimum” set of variables (state variables) 
along with the input sufficient to uniquely specify the dynamic system behavior for all 
t over the interval t e [to, oo). The state vector is the collection of state variables into a 
single vector. The idea of a minimal set of state variables is critical and all techniques 
to define them must ensure that the smallest number of “independent” states have 
been defined in order to avoid possible violation of some important system theoretic 
properties [2, 3], 

Let us consider a general deterministic formulation of a nonlinear dynamic system 
including the output (measurement) model in state-space form (continuous-time) 1 

x, = A(x t , u t ) = a(x t ) + b(u ,) 

y, = C(x t , u t ) = c(x t ) + d(u t ) 

for x t , y t and u t the respective AI*-state, A^-output and A(,-input vectors with corre¬ 
sponding system (process), input, measurement (output) and feedthrough functions. 
The A/*-dimensional system and input functions are defined by a(-) and /?(■), while 
the A(y-dimensional output and feedthrough functions are given by c(-) and d(-). 

In order to specify the solution of the A^-th order differential equations completely, 
we must specify the above noted functions along with a set of N x -initial conditions at 
time to and the input for all t > to. Here N x is the dimension of the “minimal” set of 
state variables. 

If we constrain the state-space representation to be linear in the states, then we 
obtain the generic continuous-time, linear time-varying state-space model given by 


x, = A t x t + B t u t 

y t = C,x,+D t u t (4.1) 

where x t e 1Z Nxx 1 , u, e 1Z N “ X 1 , » e 1Z N > x 1 and the respective system, input, out¬ 
put and feedthrough matrices are: A e lZ NxXNx , B e 1Z NxXN “, CeH N y* Nx and 
D£H N y xN “. 

The interesting property of the state-space representation is to realize that these 
models represent a complete generic form for almost any physical system. That is, if 
we have an RLC-circuit or a MCK-mechanical system, their dynamics are governed 
by the identical set of differential equations, but only their coefficients differ. Of 
course, the physical meaning of the states are different, but this is the idea behind state- 
space —many physical systems can be captured by this generic form of differential 
equations, even though the systems are physically different. Systems theory, which is 
essentially the study of dynamic systems, is based on the study of state-space models 
and is rich with theoretical results exposing the underlying properties of the dynamic 


1 We separate x, and u t for future models, but it is not really necessary. 
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system under investigation. This is one of the major reasons why state-space models 
are employed in signal processing, especially when the system is multivariable having 
multiple inputs and multiple outputs. Next we develop the relationship between the 
state-space representation and input-output relations—the transfer function. 

For this development we constrain the state-space representation above to be a 
(deterministic) linear time-invariant (LTI ) state-space model given by 


x t = A c x t + B c u, 

y t = C c x t + D c u t (4.2) 

where A t —> A c , B t —> B c , C t —> C c and D t —> D c their time invariant counterparts 
with the subscript, “c”, annotating continuous-time matrices. 

This LTI model corresponds the constant coefficient differential equation solutions 
which can be solved using Laplace transforms. Taking the Laplace transform of these 
equations, we have that 


.vA'(.v) - x, n = A c X(s ) + B c U(s) 


and solving for X(s) 

X(s) = (si - A c r l x t0 + (si - A c )~ 1 B c U(s) (4.3) 

where I e 7Z Nx x Nx is the identity matrix. The corresponding output is 

Y(s) = C c X(s ) + D c U(s ) (4.4) 

Therefore, combining these relations, we obtain 

Y(s) = [C c (sl - A c )~ l B c + D c ]U(s) + C c (sl - A c )~ l x to (4.5) 

From the definition of transfer function (zero initial conditions), we have the desired 
result 

H(s ) = C c isl - A c r ' B c + D c (4.6) 

Taking the inverse Laplace transform of this equation gives us the corresponding 
impulse response matrix of the L77-system as [1] 

H(t, t) = C c e^ c(f ~ r) B c + D c 8(t - r) for t > r (4.7) 


So we see that the state-space representation enables us to express the input-output 
relations in terms of the internal variables or states. Note also that this is a multivariable 
representation as compared to the usual single input-single output (scalar) systems 
models that frequently appear in the signal processing literature. 
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Now that we have the multivariable transfer function representation of our LTI 
system, we can solve the state equations directly using inverse transforms to obtain 
the time-domain solutions. First we simplify the notation by defining the Laplace 
transform of the state transition matrix or the so-called resolvent matrix of systems 
theory [1, 3] as 

(4.8) 

Therefore we can rewrite the transfer function matrix as 

H(s) = C c <t> c (s)B c + D, (4.9) 

and the corresponding state-input transfer matrix by 

X(s) = <f> c (s)x t0 + <$> c (s)B c U{s) (4.10) 

Taking the inverse Laplace transformation gives the time domain solution 
x f = C ‘[XGv)] = 3> c (UoH + ®c(t,t 0 )B c * u t 


x t — <t> c (f, to)x tQ + a)B c u a da 

zero-input '--' 

with corresponding output solution 

y, = C c <& c {t,to)x to + J C c <S> c (t,a)B c u a da 


(4.11) 


(4.12) 


The state transition matrix, 4> c (f, to), is the critical component in the solution of the 
state equations. Ignoring the input (zero-state) of the LTI state-space system, we have 
the set of (homogeneous) vector-matrix state equations 


ic t = A c x t (4.13) 

It is well known from linear algebra [1] that this equation has the matrix exponential 
as its solution 

x, = <t> c (t, foK = A (f - fo) x, 0 (4.14) 

The meaning of “transition” is now clear since knowledge of the transition matrix 
<ty (f, to) enables us to calculate the transition of the state vector from time to to any 
t > to. Taking the Laplace transform of this equation gives 


X(s) = d> c (s)x tQ = (sf -A c r'x t(i 
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with 

tM = C~\(sl - AJ- 1 ] (4.15) 

We have that the state transition matrix for a LTI system is 

<fc c (Mo) = e A ^\ t>t 0 (4.16) 

Revisting the continuous-time system of Eq. 4.11 and substituting the matrix 
exponential for the state transition matrix gives the LTI solution as 

x t = e AAt - to) x to + f e^ {t - a) B c u a da (4.17) 

Jto 

with corresponding measurement system 


y, = C c x t (4.18) 

In general, the state transition matrix satisfies the following properties [1,2]: 

1. 4> c (t, to) is uniquely defined for t, to e [0, oo) [Unique] 

2. 4> c (t, t) = I [Identity] 

3. <t> c (t) satisfies the matrix differential equation: 

6 c (Uo)=A ? 4> c (Mo), *c(fo,to) = I, t>t 0 (4.19) 

4. 4> c (t, to) = 4>c(A t) x 4> c (t, a ) x • • • x <I> C (/J, to) [Semi-Group] 

5. 4> c (t, t) _ 1 = 4>c(u t) [Inverse] 

Thus, the transition matrix plays a pivotal role in LTI systems theory for the analysis 
and prediction of the response of linear time-invariant and time-varying systems [2], 
For instance, the poles of a LTI govern such important properties as stability and 
response time. The poles are the roots of the characteristic (polynomial) equation of 
A c , which is found by solving for the roots of the determinant of the resolvent, that is, 

|O c G)| = I (si - A c )\ s=pi = 0 (4.20) 

Stability is determined by assuring that all of the poles lie within the left half of the 
S-plane. The poles of the system determine its response as: 

N x 

y t = Y j K i e-' ,it (4.21) 


where A c is diagonal which in this case is given by A c = diagfpi ,/?2,... ,Pn x ]> the 
eigenvalues of A c . Next we consider the sampled-data, state-space representation. 
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4.3 SAMPLED-DATA STATE-SPACE MODELS 

Sampling a continuous-time system is commonplace with the advent of high speed 
analog-to-digital converters {ADC) and modem computers. A sampled-data sys¬ 
tem lies somewhere between the continuous analog domain (physical system) and 
the purely discrete domain (stock market prices). Since we are strictly sampling a 
continuous-time process, we must assure that all of its properties are preserved. The 
well-known Nyquist sampling theorem precisely expresses the required conditions to 
achieve “perfect” reconstruction of the process from its samples [13]. 

Thus, if we have a physical system governed by continuous-time dynamics and we 
“sample” it at given time instants, then a sampled-data model can be obtained directly 
from the solution of the continuous-time state-space model. That is, we know from 
the previous section that 


)x ta + f O e (t 
Jt 0 


where O c (-, •) is the continuous-time state transition matrix that satisfies the matrix 
differential equation 


O c (f, t 0 ) = A t O c (f, t 0 ), 


>c(tO, to) =1, t>to 


Sampling this system such that t —> 4 over the interval (4, 4-1], then we have the 
corresponding sampling interval defined by A4 := 4 — 4_i. Note this representation 
need not necessarily be equally spaced—another important property of the state-space 
representation. Thus the sampled solution becomes (with notation change) 

x(4) = <t>(4,4-i)x(4-i) + J 0(4, a)B c {a)u a da (4.22) 

and therefore from the differential equation above, we have the solution 

0(4,4_i| = J' A(a)0(4, a) da for O(f 0 ,4) = i (4.23) 


where 0(4,4_i) is the sampled-data state transition matrix—the critical component 
in the solution of the state equations enabling us to calculate the state evolution in 
time. 

If we further assume that the input excitation is piecewise constant (u a —> n(4_i)) 
over the interval (4,4-1 ], then it can be removed from under the superposition integral 
in Eq. 4.22 to give 


*(4) = 0(4,4_i)x(4_i) + 


(/I 


0(4? ot)B c {a) da x n(4_, 


(4.24) 
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Under this assumption, we can define the sampled input transmission matrix as 

B(t k - 1) := £ * <&(/*, a)B c (a) da (4.25) 

and therefore the sampled-data state-space system with equally or unequally sampled 
data is given by: 


x(tk) = <&(t k ,t k -i)x(t k -i) + B{t k -\)u{t k -\) 

y(t h ) = C(t k )x(t k ) (4.26) 

Computationally, sampled-data systems pose no particular problems when care is 
taken, especially since reasonable approximation and numerical integration methods 
exist [17]. For instance, the following three methods of solving for the state transition 
and input transmission matrices can be quite effective. If we constrain the system to 
be LTI, then we have that the matrix exponential can be represented by the Taylor 
series expansion 

= £ (AcAtkY ( 4 . 27 ) 

;=o *' 

Truncating the series is possible at an acceptable error magnitude [17]. This is called 
the series approach to estimating the state transition matrix and can be determined 
for a finite sum. For example, a simple first-order approximation uses the relations: 

Wtkrik j) « (/ + At k A c ) 

B(t k ) « A t k B c (4.28) 

This direct approach can yield unsatisfactory results; however, one improved 
solution is based on the Pade’ approximation incorporating a scaling and squaring 
technique [17, 18]. This scaling and squaring property of the matrix exponential is 
given by 

<A Af * = | ffidstrimyn ( 4 .2 9 ) 

and is based on choosing the integer m to be a power-of-two such that the exponential 
term can reliably and efficiently be calculated followed by repeated squaring. A typical 
criterion is to choose ||A||/m <$C 1 yielding a very effective numerical technique for 
either Taylor or Pade’ approximants. 

Ordinary differential equation methods using numerical integration techniques 
(e.g., Runge-Kutta, Gear’s method, etc.) offer another practical approach to solving 
for the state transition matrix and the corresponding input matrices as given by Eqs. 
4.23 and 4.25, respectively. The advantages of numerical integration techniques are 
reliability and accuracy as well as applicability to time-varying and nonlinear sys¬ 
tems. The major disadvantage is computational time which can be very high for stiff 
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(large eigenvalue spread) differential equations and variable integration step-sizes 

[17] , In any case, the sampled-data system has the property that it has evolved 
from a system with continuous dynamics and must be accurately approximated or 
numerically integrated to produce reliable solutions. 

The final class of methods we discuss briefly are the matrix decomposition methods 

[18] . These methods are based on similarity transformations, that is, 

A c = TA c T~ l and A At = T (4.30) 


where At := t — x in the continuous-time case. If the similarity transformation matrix 
is chosen to be an eigenvalue-eigenvector transformation say, T = V, then 

e kcAt = Ve A ‘ Af y— 1 (4.31) 

with A c diagonal. The matrix exponential operation becomes a simple scalar 
computation, 

e A ‘ At = diag(e AlAr ,..., e AjvAf ) (4.32) 

In fact, using the eigen-decomposition and applying the ordinary differential 
equation approach, we have that 


x{t) = A c x(t ) 

and therefore the solution is given in terms of the eigenvectors, v -, 

N 

x(t) — Y_ at e XtA, Vj (4.33) 

i=0 

where the coefficients, a, are the solution of the linear equations, V x a = x(0). This 
approach works well when A c is symmetric leading to an orthogonal set of eigenvec¬ 
tors, but can be plagued with a wealth of numerical issues that need reconciliation 
and sophisticated numerical techniques [18]. 

Consider the following example to demonstrate these approaches. 

Example 4.1 

Suppose we are given the following system: 


x t = —0303x t + u t 


y t = 2x t 
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with sampling interval, At = 0.1, initial state, x t0 = 2 and input u, a sequence of 
irregularly-spaced step functions. Using a first-order approximation, we obtain 


x(t k ) = (1 - 0.303A4)*(4-t) + A4 m(4_i) = 0.97x(4_i) + 0.1w(4_i) 

y(tk) = 2 x( 4) 


We performed numerical integration on the differential equations and a Taylor 
series approximation using 25-terms to achieve an error tolerance of e=10 -12 . 
The simulations of the state, output and input are shown in Fig. 4.1a-c. The true 




Time (sec) 

(b) 



FIGURE 4.1 Sampled-data simulation (unequally spaced) of first order continuous pro¬ 
cess: (a) States: continuous (numerical integration) and sampled (series approximation). 

(b) Outputs: continuous (numerical integration) and sampled (series approximation). 

(c) Input: continuous (numerical integration). 
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continuous-time solution is shown in the figure as the solid line, while the sampled- 
data (discrete) solution is shown with the dotted lines. The plots reasonably overlay 
each another, but it is apparent that there is some truncation error evolving which can 
be observed at the peak amplitudes of the simulation. Thus we see that the continuous¬ 
time system in some cases can be reasonably approximated even by the Taylor-series 
approach. AAA 

This completes the discussion of sampled-data systems and approximations, next 
we consider the discrete-time systems. 


4.4 DISCRETE-TIME STATE-SPACE MODELS 

Discrete state-space models evolve in two distinct ways: naturally from the problem 
or from sampling a continuous-time dynamical system. An example of a natural 
discrete system is the dynamics of balancing our own checkbook. Here the state is the 
evolving balance given the past balance and the amount of the previous check. There 
is “no information” between time samples and so this model respresents a discrete¬ 
time system that evolves naturally from the underlying problem. On the other hand, if 
we have a physical system governed by continuous-time dynamics, then we “sample” 
it at given time instants as discussed in the previous section. So we see that discrete¬ 
time dynamical systems can evolve from a wide variety of problems both naturally 
(checkbook) or physically (circuit). In this text we are primarily interested in physical 
systems (physics-based models), so we will concentrate on sampled systems reducing 
them to a discrete-time state-space model. 

We can use a first-difference approximation 2 and apply it to the general LTI 
continuous-time state-space model to obtain a discrete-time system, that is, 


A T 

i y(t) = C c x(t ) + D c u(t) 


A c x(t - 1) + B c u(t - 1) 


Solving for x(t), we obtain 


x(t) = (/ + A c AT)x(t -l) + B c ATuit - 1) 

y(t) = C c x(t ) + D c u(t) (4.34) 


Recognizing that the first difference approximation is equivalent to a first order 
Taylor series approximation of A c gives the discrete system, input, output and 


2 This approximation is equivalent to a first-order Taylor series approximation. 
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feedthrough matrices as 


A « I + A c AT + 0(AT 2 ) 

B « B c AT 
C*C c 

D D c (4.35) 

We define the nonlinear discrete-time state-space representation by its process or 
system model 


x(t) = A(x(t - 1), u(t - 1)) = a[x(t - 1)] + b[u(t - 1)] (4.36) 

and corresponding measurement or output model by 

y(f) = C(x(t), u(t )) = c[x(t)] + d[u(t)] (4.37) 

where x(t), u(t), y(t ) are the respective discrete-time, ATrstate, A^-input and A' v -output 
vectors with corresponding system (process), input, output and feedthrough functions: 
the A^-dimensional system and input functions, a[-], b[ ] and the N y -dimensional 
output and feedthrough functions, c[-], d[- 1. 

The discrete linear time-varying state—space representation is given by the system 
or process model as 


x{t) = A{t - 1 )x{t - 1) + B{t - 1 )u{t - 1) (4.38) 

and the corresponding discrete output or measurement model as 

y(t) = C(t)x(t) + D(t)u(t) (4.39) 

where x,u,y are the respective /V f -state, A'„-input, A^ v -output and A,B,C,D are 
the (N x x A j-system, (N x x A„)-input, ( N y x A v )-output and (N y x A„)-feedthrough 
matrices. 

The state-space representation for linear, time-invariant, discrete systems is 
characterized by constant system, input, output and feedthrough matrices, that is, 

A{t) = A , B{t) = B, and C{t) = C, D(t) = D 

and is given by the LTI system 


x(t) = Ax{t - 1) + Bu{t - 1) 
y(t ) = Cx(t) + Du(t) 


(4.40) 
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The discrete system representation replaces the Laplace transform with the 
Z-transform defined by the transform pair: 

X(z) := 

t =o 

x(ty rn J X(z)z~ l 2 dz (4.41) 

Time-invariant state-space discrete systems can also be represented in input- 
output or transfer function form using the Z-transform to give 

H(z) = C(zl - A)~ 1 B + D (4.42) 

Also taking inverse Z-transforms, we obtain the discrete impulse response 
matrix as 

H(t, k) = CA‘~ k B + D for t > k (4.43) 

The solution to the state-difference equations can easily be derived by induction 
[3] or using the transfer function approach of the previous subsection. In any case it 
is given by the relations 

xit) = k)x(k) + J2 0(7 ’ i)B(i)u(i) for t > k (4.44) 

where <t>(7, k) is the discrete-time state-transition matrix. For time-varying systems, 
it can be shown (by induction) that the state-transition matrix 3 satisfies 

4>(t, k) = A(t - 1) ■ Aft - 2) ■ • • A(k) 
while for time-invariant systems the state-transition matrix is given by 

4>(f,k) = A'~ k for t > k 

The discrete state transition matrix possesses properties analogous to its 
continuous-time counterpart, that is, 

1. d>(t,k) is uniquely defined [Unique] 

2. <b(t, t) = I [Identity] 

3. <b(t, k) satisfies the matrix difference equation: 

$(£,£) = /, t>k+ 1 

usition matrix is: <I>(f, tp) = where 


<t>(t,k) = A(t- l)0(f- 1 ,k). 


(4.45) 
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4. <b(r, k) = <&(>,(-l)x$(f-U-2)x • • • x$(HU) [Semi-Group] 

5. <S >- 1 (t, k) = <t>(k, t) [Inverse] 

As in the continuous case, the discrete-time transition matrix has the same 
importance in discrete systems theory which we discuss next. 


4.4.1 Discrete Systems Theory 

In this subsection we investigate the discrete state-space model from a systems theory 
viewpoint. There are certain properties that a dynamic system must possess in order to 
assure a consistent representation of the dynamics under investigation. For instance, 
it can be shown [2] that a necessary requirement of a measurement system is that it 
is observable, that is, measurements of available variables or parameters of interest 
provide enough information to reconstruct the internal variables or states. Mathemat¬ 
ically, a system is said to be completely observable, if for any initial state, say x(0), 
in the state-space, there exists a finite t > 0 such that knowledge of the input u(t) 
and the output y(t) is sufficient to specify x(0) uniquely. Recall that the deterministic 
linear state-space representation of a discrete system is defined by the following set 
of equations: 


State Model : x(t) = A(t — \)x{t — 1) + B(t — 1 )u(t — 1) 
with corresponding measurement system or output defined by 
Measurement Model : y(t) = C(t)x(t) 

Using this representation, the simplest example of an observable system is one in 
which each state is measured directly, therefore and the measurement matrix C is a 
N x x N x matrix. Thus, from the measurement system model, we have that in order to 
reconstruct x(t) from its measurements y(t), then C must be invertible. In this case the 
system is said to be completely observable; however, if C is not invertible, then the 
system is said to be unobservable. The next level of complexity involves the solution 
to this same problem when C is a N y x N x matrix, then a pseudo-inverse must be 
performed instead [1, 2], In the general case the solution gets more involved because 
we are not just interested in reconstructing x(t), but x(t) over all finite values of t, 
therefore, we must include the state model, that is, the dynamics as well. 

With this motivation in mind, we now formally define the concept of observability. 
The solution to the state representation is governed by the state-transition matrix, 
4>(t, 0), where recall that the state equation is [3] 


x(t) = <t>(t, 0)x(0) + <P(t, k)B(k)u(k ) 



108 STATE-SPACE MODELS FOR BAYESIAN PROCESSING 


Therefore, pre-multiplying by the measurement matrix, the output relations are 
t -1 

y(t) = C(t)4>(t, 0)x(0) + k)B{k)u{k ) (4.46) 

k =0 


or rearranging we define 


y(t) := y(t) - C(t)<t>(t, k)B(k)u(k ) = C(f)4>(t, 0)x(0) (4.47) 


The problem is to solve this resulting equation for the initial state; therefore, 
multiplying both sides by we can infer the solution from the relation 

4>'(f, 0)C'(t)C(t)<&(t, 0)x(0) = <t>'(t,0)C(t)y(t) 

Thus, the observability question now becomes under what conditions can this equation 
uniquely be solved forx(0)? Equivalently, we are asking if the null space of C(t)<i>(t, 0) 
is 0 e 1Z Nx x 1 . It has been shown [2, 4] that the following N x x N x observability 
Gramian has the identical null space, that is, 


0(0, t) := J2 <*>'(*, 0)C\k)C(k)<S>(k, 0) (4.48) 


which is equivalent to determining that 0(0, t) is nonsingular or rank N x . Further 
assuming that the system is LT1 leads to the NN y x N x observability matrix [4] 
given by 

r c i 


0(N) := 


(4.49) 


Lca^-'J 

It can be shown that a necessary and sufficient condition for a system to be com¬ 
pletely observable is that the rank of O or p[0(AO] must be N x . Thus, for the LTI case, 
checking that all of the measurements contain the essential information to reconstruct 
the states for a linear time-invariant system reduces to that of checking the rank of the 
observability matrix. Although this is a useful mathematical concept, it is primarily 
used as a rule-of-thumb in the analysis of complicated measurement systems. 

Analogous to the system theoretic property of observability is that of controllabil¬ 
ity, which is concerned with the effect of the input on the states of the dynamic system. 
A discrete system is said to be completely controllable if for any x(t), x(0) e 7 Z Nx there 
exists an input sequence, (n(t)}, t = 0,..., N — 1 such that the solution to the state 
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equations with initial condition x(0) is x(t) for some finite t. Following the same 
approach as for observability, we obtain that the controllability Gramian defined by 


C(0,t) := &(0,k)B(k)B'(k)&(0,k) 


(4.50) 


is nonsingular or p[C(0, f)] = N x 

Again for the LTI system, the N x x NN U controllability matrix defined by 

C(N) := [B\AB\... lA^ -1 #] (4.51) 

must satisfy the rank condition, p[C] = N x to be completely controllable [4], 

If we continue with the LTI system description, we know from Z-transform theory 
that the discrete transfer function can be represented by an infinite power series, that is, 


H(z) = C(zl - A)~ l B = H(k)z~ k for H(k) = CA k ~ l B (4.52) 


where H(k) is the N y x N u unit impulse response matrix which may also be viewed 
as a Markov sequence with ( A,B , C) defined as the Markov parameters. 

The problem of determining the internal description (A, B, C) from the external 
description ( H(z ) or \H(k)}) of Eq. 4.43 is called the realization problem. Out of all 
possible realizations, (A, B, C), having the same Markov parameters, those of small¬ 
est dimension are defined as minimal realizations. It will subsequently be shown that 
the dimension of the minimal realization is identical to the degree of the charac¬ 
teristic polynomial (actually the minimal polynomial for multivariable systems) or 
equivalently the degree of the transfer function (number of system poles). 

In order to develop these relations, we define the (N x N y N u ) x (N x N y N u ) Hankel 
matrix by 


'tf(l) 

H{ 2) 

H(N) 

H{ 2) 

H (3) 

■ H(N + 1) 

H(N) 

H(N+ 1) •• 

H(2N) 


(4.53) 


Suppose the dimension of the system is N x . Then the Hankel matrix could be con¬ 
structed such that N = N X using 2 N x impulse response matrices which tells us the 
minimum number of terms we require to extract the Markov parameters. Also the 
p\H(N)] = N x which is the dimension of the minimal realization. If we did not know 
the dimension of the system, then we would let (in theory) N —*■ oo and determine 
the rank of oo). Therefore, the minimal dimension of an “unknown” system is the 
rank of the Hankel matrix. In order for a system to be minimal it must be completely 
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controllable and completely observable. This can be seen from the fact that the Hankel 
matrix factors as: 


' CB 

■ CA N ~ l B' 


- C - 

_CA n ~ x B ■ ■ 

■ ca 2N ~ 2 b_ 


ca n ~'_ 


(4.54) 


or more simply 


H(N) = 0(N)C(N) 


(4.55) 


From this factorization it follows that the p[H(N)] = min[p(C>(A0), p(C(N))] = N x . 
Therefore, we see that the properties of controllability and observability are care¬ 
fully woven into that of minimality, and testing the rank of the Hankel matrix yields 
the dimensionality of the underlying dynamic system. This fact will prove crucial 
when we must “identify” a system, E = {A, B, C], from noisy measurement data. For 
instance, many of the classical realization techniques [19,20] rely on the factorization 
of Eq. 4.55 to extract the system or process matrix, that is, 



■ C ■ 


- CA ~ 

0(N ) x A = 

ca n ~ 2 

A = 

ca n ~ 1 


_CA N ~ l _ 


. CA N _ 


Solving for A using the pseudo-inverse [18] gives 

A = 0 # (N)0^ where 0 # (N) := (O'(N)0(N))~ l O'(N) (4.57) 

An efficient realization technique for a scalar (single input/single output) system 
can be obtained by performing a singular value decomposition (SVD ) of the Hankel 
matrix constructed from the impulse response sequence, H(k)\ k— 1,..., N, that is, 

U(N) = U x A x V with p(U(N)) = N x (4.58) 

where A = diagfA.,]; i = I, N*; N*:= N x N y N u , [A.,] is the set of singular values, 
and U and V the corresponding left and right singular vector matrices [18]. A variety 
of techniques can be used to estimate the rank of a matrix, perhaps the simplest 
(given the SVD) being the best rank approximation a (percentage) based on the ratio 
of singular values, 


—(it) 


100 


(4.59) 
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Here a (%) is the percentage of the original matrix (Hankel) approximated by choosing 
N x . A threshold, r x (%) can be selected to search over i for that particular N x that 
“best” approximates H up to the threshold, that is, 

a„(N x ) > r x (%) for n = 1,... ,N X (4.60) 


Once the rank is estimated, then 


U(N) = UAV = U 


A 

0 


0 

0 


V = U A V 


(4.61) 


for u G n^**"*, A G n N * xN *, V G n N ^ N “. 

When the decomposition and rank (N x ) are determined, the system triple (scalar), 
£ = {A, b, c ) are identified by [21] 

A =Hr ^ u' t7 f A ^ i ; b= A ^ y; and c = U A * (4.62) 

where (as before) U^ is the eigenvector matrix, U shifted up one row with a row of 
zeros (O') appended [21], Here A 1 / 2 (A 1 / 2 )' are the square-root matrices of A. It has 
been shown that this method results in a stable realization such that the eigenvalues 
of A, X(A < 1). 


Example 4.2 


Consider the following scalar example with impulse response, H(k) = { 1, 1/2, 1/4, 
1/8 1/16} with N= 5. Using the SVD approach, we would like to extract the 
realization £ = (A, b, c ). Creating the Hankel matrix and performing the SVD, we 
obtain 


H( 5) = 


' 1 

1/2 

!/4 


1/2 

1/4 

1/8 


1/4 " 
1/8 
1/16 


= UAV 


A = diag[1.31, 0, 0] yielding a rank, N x — 1; the singular vectors are 


—0.8729 

-0.4364 

-0.2182 


-0.4364 

0.8983 

-0.0509 


-0.2182" 

-0.0509 

0.9746 


"—0.8729 

0.4880 

0 


-0.4364 

-0.7807 

-0.4472 


-0.2182" 

-0.3904 

0.8944 


Thus the best rank approximants are: A = 1.3125 and 


"—0.8729"! 
-0.4364 ; t/ 1 

-0.2182 


—0.4364" 

—0.2182 

0 


V = 


"—0.8729" 
—0.4364 
-0.2182 
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Therefore, we obtain the realizations: 

A = A - ^ if if a ' 1 

= (0.873)[—0.8729 -0.4364 -0.2182] 


(1.1456) 

= 0.48^ 1/2; V = A ! V= [1 0 0]; c = E/P = [- 1 0 0] AAA 


-0.4364' 

-0.2182 


This completes the subsection on discrete systems theory. It should be noted that 
all of the properties discussed in this section exist for continuous-time systems (see 
[2] for details). 


4.5 GAUSS-MARKOV STATE-SPACE MODELS 

In this section we extend the state-space representation to incorporate random 
inputs or noise sources along with random initial conditions. We briefly discuss the 
continuous-time representation evolving to the sampled-data model and then provide 
a detailed discussion of the discrete-time Gauss-Markov model which will be used 
extensively throughout this text. 

4.5.1 Continuous-Time/Sampled-Data Gauss-Markov Models 

We start by defining the continuous-time Gauss-Markov model. If we constrain the 
state-space representation to be linear in the states, then we obtain the generic 
continuous-time, linear time-varying Gauss-Markov state-space model given by 


x t = A,x t + B,u t + W t w t 

y t = C,x t + v, (4.63) 

where x t GlZ NxXl , u t e TZ NuXl , w, e TZ N " ,X 1 , y t e H Nyx 1 , v t eK NyXl are the 
continuous-time state, input (deterministic), process noise, measurement and mea¬ 
surement noise 4 vectors with corresponding system (process), input, output and 
feedthrough matrices: A e U N * xN *, B e TZ NyXN \ W e H N * xN <» and C e K N ** N *. 

The continuous-time Gaussian stochastic processes are w t ~Af(0,R WcWc (t)) and 
v t ~ Rv c v c (t)) with initial state defined by xq ~ AA/co, Pq)- The corresponding 


4 Note that process and measurement noise sources are different. The process noise term is used primarily 
to model uncertainties in the input, state or even possibly unknown model parameters and it is “filtered” or 
colored (correlated) by the system dynamics, while measurement noise is uncorrelated and used to model 
instrumentation or extraneous environmental noise. 
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statistics of this model follow with the dynamic mean derived directly from Eq. 4.63 
(see [6-8] for details) as 

m Xt = A t m Xl + B,u t (4.64) 


and the variance (continuous-time Lyapunov equation) 


P, = A t P t + P t A\ + W,R WcWc (t)W; 


with corresponding covariance 


I O c (t, t)P x ,x for t > T 
P U t& c (t,r) for t < z 


(4.65) 


(4.66) 


where O c (f, r) is the continuous state transition matrix. 

As before, for the deterministic case, the stochastic sampled-data system follows 
from the continuous-time state solution 


Jt 0 


x, = O c (f, to)x to + I <t> c (t,z)B x u x dz + / 4> c (f, z)W t w T dr 


(4.67) 


Sampling this system with t—> 4- over the interval (4,4-1], then the sampled 
solution becomes (with notational change) 5 


x(4) = 0(4,4-iM4-i) + / Htk,r)B T u t dr+ <t>{t k ,z)W x w T dz (4.68) 


If we further assume that the input excitation is piecewise constant (u x -> u{t k -\)) 
over the interval (4,4_ i ], then it can be removed from under the superposition integral 
in Eq. 4.68 to give 


x(t k )= <&(t k ,t k -i)x(t k -i)+[ I A>(t k ,x)B x dz ) x w(4_i)+ / d>(4,T)W r r u; r £/T 

(4.69) 

We define the sampled-data input transmission matrix as 


({> ,z)B x dz^j x u(t k -i) + £ 
it transi 

-Si 

e covar 

Rwwitk- 1):= f k 0(4, z)W x R VJcWc {t)W' z A>'(t k , z)dz (4.71) 


B(t k -i) '■= J <t>(t k ,z)B x dz 

with the sampled-data process noise covariance matrix is given by 


(4.70) 


5 We note in passing that this solution is conceptual and must actually follow a much more rigorous 
framework embedded in stochastic integrals and beyond the scope of this text (see [6-8] for details). 
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and therefore the sampled-data state-space system with equally or unequally 
sampled-data is given by: 


x(tk) = <t>(4, 4_1 )x(t k - 1) + fit 4-1 )u(t k - 1) + W(t k - 1 )w(t k - 1) 

y(t k ) = C(t k )x(t k ) + v(t k ) (4.72) 

for w tk ~N(0,R ww (t k )) and v, k ^ Af (0, R vv (t k )) with initial state defined by 
x(?o) ~ Af(x(to), P(ti))). Recall from the deterministic solution that if we use a first 
order Taylor series approximation we have that: 

0(4,4- 1) «/+A4xA(4_i) 

£(4_i) « A 4 x B tk _i 
Rwwitk- 1) « A4 X i)K-i 

Rw(t k ) R VcVc (t k )/At k (4.73) 

The corresponding mean and covariance of the sampled-data process are: 


m x (t k ) = A{t k -\)m x {t k -\) + B(4-i)m(4-i) (4.74) 

and the measurement mean vector m y as 


m y (t k ) = C(t k )m x (t k ) (4.75) 

The state variance is given by 

P(t k ) = A(t k - 1 )B(4_i )A' (4_ i ) + W(t k i )R W w(tk \)W'(t k ,) (4.76) 

and the measurement variance is 


Ryy(tk) = C(t k )P(t k )C’(t k ) + R vv (t k ) (4.77) 

This completes the development of the sampled-data Gauss-Markov representa¬ 
tion evolving from a continuous-time stochastic process, next we consider the more 
pragmatic discrete-time model. 


4.5.2 Discrete-Time Gauss-Markov Models 

Here we investigate the case when random inputs are applied to a discrete state-space 
system with random initial conditions. If the excitation is a random signal, then the 
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state is also random. Restricting the input to be deterministic u(t — 1) and the noise to 
be zero-mean, white, random Gaussian w(t — 1), the Gauss-Markov model evolves as 

x{t) = A(t - \)x{t - 1) + Bit - 1 )u(t - 1) + Wit - l)u;(t - 1) (4.78) 

where w ~ M’(0,R ww (f - 1)) and x(0) ~7\A(ic(0),P(0)). 

The solution to the Gauss-Markov equations can easily be obtained by induction 
to give 

/ l t l 

Jt it) = <&(t, k)xik) + ^2 ® (t ’ 1 + 1 )B(i)u(i) + ^ 4>(t, i + \ )WH)w(i) (4.79) 
i=k i—k 

which is first-order Markov depending only on the previous state. Since xit) is just 
a linear transformation of Gaussian processes, it is also Gaussian. Thus, we can 
represent a Gauss-Markov process easily using the state-space models. 

When the measurement model is also included, we have 

yit) = Cit)xit) + vit) (4.80) 

where v ~ A/(0, R vv it )). The model is shown diagrammatically in Fig. 4.2. 

Since the Gauss-Markov model of Eq. 4.78 is characterized by a Gaussian distri¬ 
bution, it is completely specified statistically by its mean and variance. Therefore, if 
we take the expectation of Eqs. 4.78 and 4.80, respectively, we obtain the state mean 
vector m x as 


m x (t) = Ait - \)m x it - 1) + Bit - 1 )u{t - 1) (4.81) 

and the measurement mean vector m y as 

myit) = Cit)m x it ) (4.82) 



FIGURE 4.2 Gauss-Markov model of a discrete process. 
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The state variance 6 P(t) := var{x(?)} is given by the discrete Lyapunov equation: 

Pit) = A(t - 1 )P{t - \)A'(t - 1) + W(t - 1 )R ww (t - 1 )W\t - 1) (4.83) 

and the measurement variance, Ryyit) := var{y(f)} is 

Ryy(t) = C(t)P(t)C'(t) + R vv {t ) (4.84) 

Similarly, it can be shown that the state covariance propagates according to the 
following equations: 


10(t, k)P(k) for t >k 
\P(t)<t>'(t,k) for t <k 


(4.85) 


We summarize the Gauss-Markov and corresponding statistical models in 
Table 4.1. 


TABLE 4.1 Gauss-Markov Representation 

State Propagation 

x(t)=A(t- 1 )x(t- l) + B(t - 1 )u{t- 1 ) + W(t- 1 )w(t- 1) 
State Mean Propagation 
m x (t ) = AQ - 1 )m x {t - 1) + Bit - ! )«(/ - 1) 

State Variance/Covariance Propagation 
P{t) = A(t - 1 )P{t - l)A'(f - 1) + W(t - 1 )R ww (t - V)W’{t - 1) 
t >k 
t <k 

Measurement Propagation 

Measurement Mean Propagation 

m y (t) = C{t)m x (t) 

Measurement Variance/Covariance Propagation 
R yy (t) = C(t)P(t)C'(t) + R vv (t) 

Ryy(t,k) = C(t)P(t)C\t ) + R vv (t,k) 


\ <t>(f, k)P(k) 
P(tW(t,k) 


y(t) = C(t)x(t) + v(t) 


’We use the shorthand notation, P(k) k) = co\{x(k),x(k)} = var{x(k)}, throughout this 
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If we restrict the Gauss-Markov model to the stationary case, then 
A(t) = A,B(t ) = B, C(t ) = C, W{t) = W, R ww {t ) = R ww , and R vv (t) = R vv 
and the variance equations become 

P(t) = AP{t - 1 )A’ + WR ww W' 
and 

Ryy(t) = CP{t)C' + R vv (4.86) 

At steady-state (t —> oo), we have 

P(t) = P(t - 1) = • • • = P sS :=P 

and therefore, the measurement covariance relations become 

Ryy(0) = CPC' + R vv for lag k = 0 (4.87) 

By induction, it can be shown that 

Ryy(k) = CA m PC' for k ^ 0 (4.88) 

The measurement power spectrum is easily obtained by taking the Z-transform of 
this equation to obtain 

Syy(z) = CS xx (z)C' + S vv (z) (4.89) 

where 

S«(z) = T(z)S ww (z)T'(z~ 1 ) for T(z) = (z,l - Ar'w 


S ww (z) = R ww and S vv (z) = R vv 
Thus, using H(z) = CT(z), the spectrum is given by 

S yy (z) = H(z)R W wH'(zr l ) + R vv (4.90) 

So we see that the Gauss-Markov state-space model enables us to have a more gen¬ 
eral representation of a multichannel stochastic signal. In fact, we are able to easily 
handle the multichannel and nonstationary statistical cases within this framework. 
Generalizations are also possible with the vector models, but the forms become quite 
complicated and require some knowledge of multivariable systems theory and canon¬ 
ical forms (see [1] for details). Before we leave this subject, let us consider a simple 
input-output example with Gauss-Markov models. 
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Example 4.3 

Consider the following difference equation driven by random (white) noise: 
y(t) = —ay(t - 1) + e{t - 1) 

The corresponding state-space representation is obtained as 

x(t) = —ax(t - 1) + w(t - 1) and y(t) = x(t) 

Taking Z-transforms (ignoring the randomness), we obtain the transfer function 


H(z) = - 


1 


1 — az 1 

Using Eq. 4.83, the variance equation for the above model is 
Pit) = a 2 P(t -l ) + R ww 


Assume the process is stationary, then P{t) = P for all t and solving for P it follows 
that 



Therefore, 

si\k\]? f? 

Ryyik ) = CA W PC' = -—p and R yy (0) = CPC' + R vv = ^ 

Choosing R ww = l— a 2 gives R yy (k) = a) k K Taking Z-transforms the discrete 
power spectrum is given by 

Syyiz) = Hiz)R ee H'{z - 1 ) + R vv = 1 1 — 

1 - az~ l 1 — az 

Therefore, we conclude that for stationary processes these models are equivalent. 

Now if we assume a nonstationary process and let a= — 0.75,x(0)~ 
jV(l, 2.3), w ~ Af(0, 1), and v ~ A/"(0,4), then the Gauss-Markov model is given by 


x(t) = 0.75 x(t - 1) + wit - 1) and y(t) = x(t) + v(t) 

The corresponding statistics are given by the mean relations 

m x (t ) = 0.75 m x (t — 1) m x ( 0) = 1 
m y (t) = m x (t) m y (Q) = m x ( 0) 

and the variance equations 


Pit) = 0.5625 Pit - 1) + 1 and Ryyit) = Pit) + 4 
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GO 



FIGURE 4.3 Gauss-Markov simulation of first-order process. 

We apply the simulator available in SSPACK_PC [13] to obtain a 100-sample real¬ 
ization of the process. The results are shown in Fig. 4.3a through c. In a and b we 
see the mean and simulated states with corresponding confidence interval about the 
mean, that is, 

[m*(0 ± 1.96 v^CO] 
and 

P = Rxmv _ 2 286 

1 — a 1 

Using the above confidence interval, we expect 95% of the samples to lie within 
( m x -»■ 0) ±3.03(1.96 x V2.286). From the figure we see that only 2 of the 100 sam¬ 
ples exceed this bound, indicating a statistically acceptable simulation. We observe 
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similar results for the simulated measurements. The steady-state variance is given by 
Ryy = p + Rw = 2.286 + 4 = 6.286 

Therefore, we expect 95% of the measurement samples to lie within ( m y —> 0) 
±5.01(1.96 x V6.286) at steady-state. This completes the example. AAA 


4.6 INNOVATIONS MODEL 

In this subsection we briefly develop the innovations model which is related to the 
Gauss-Markov representation just discussed. The significance of this model will be 
developed throughout the text, but we take the opportunity now to show its relationship 
to the basic Gauss-Markov representation. We start by extending the original Gauss- 
Markov representation to the correlated process and measurement noise case and then 
showing how the innovations model is a special case of this structure. 

The standard Gauss-Markov model for correlated process and measurement noise 
is given by 


40 = A x(t - 1) + Bu(t - 1) + W(t - l)w*(t - 1) 

yit) = Cx(t) + v*(t) (4.91) 

where R*(t, k ) := R*S(t — k) and 


" R w , n 

„ i R w ,y 


~WR ww W' 

WR wv ~ 

Rv*U 

| Rw _ 


R vw W’ | 

Rw 


Here we observe that in the standard Gauss-Markov model, the (N x + N v ) x 
(N x + N v ) block covariance matrix, R*, is full with cross-covariance matrices R w * v * 
on its off-diagonals. The standard model assumes that they are null (uncorrelated). To 
simulate a system with correlated wit) and v(t) is more complicated using this form 
of the Gauss-Markov model because R* must first be factored such that 

R* = K' R*i'] (4.92) 

where R* are matrix square roots [6,7], Once the factorization is performed, then the 
correlated noise is synthesized “coloring” the uncorrelated noise sources, wit) and 
vit) as 

[ W * (t) ] = K' w(r) l 
[ v*(t) J [/?*%(?) J 


(4.93) 
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The innovations model is a constrained version of the correlated Gauss-Markov 
characterization. If we assume that {eit)} is a zero-mean, white, Gaussian sequence, 
that is, e ~]flO,R ee ), then the innovations model [9-13] evolves as: 

x(t) = A{t - \)x(t - 1) + Bit - l)w(f - 1) + K{t - 1 )eit - 1) 

yit ) = Cit)xit) + D(t)uit ) + eit) (4.94) 

where eit ) is the N y -dimensional innovations vector and Kit — 1) is the (N x x N y ) 
weighting matrix. 7 


TAe(tf|\ 

KR ee K' | KR ee 

v _e(0 j) 

R ee K' | R ee _ 


It is important to note that the innovations model has implications in Wiener- 
Kalman filtering (spectral factorization) because R ee can be represented in factored 
or square-root form (R := +/R^/R) directly in terms of the gain and innovation 
covariance matrix, that is, 


KjRfe 


\jKeK' /ff^it - k) 


Comparing the innovations model to the Gauss-Markov model, we see that they are 
both equivalent to the case when w and v are correlated. This completes the discussion 
of the innovations model. Next we show the equivalence of the various model sets to 
this family of state-space representations. 


4.7 STATE-SPACE MODEL STRUCTURES 

In this section we discuss special state-space structures usually called “canonical 
forms” in the literature, since they respresent unique state constructs that are particu¬ 
larly useful. We will confine the models to single input/single output forms because 
the multivariable structures are too complicated for this discussion [4], Here we will 
first investigate the most popular “time series” models and then their equivalent rep¬ 
resentation in the state-space. We start with the ARMAX model and then progress to 
the special cases of this generic structure. 

4.7.1 Time Series Models 

Time series models are particularly useful representations used frequently by statisti¬ 
cians and signal processors to represent time sequences when no physics is available 


7 Actually K is called the Kalman gain matrix , which will be discussed in detail when we develop the 
model-based processor in a subsequent chapter. 
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to use directly. They form the class of black box or gray box models [13] which 
are used in predicting data. These models have an input-output structure, but they 
can be transformed to an equivalent state-space representation. Each model set has 
its own advantages: the input-output models are easy to use, while the state-space 
models are easily generalized and usually evolve when physical phenomenology can 
be described in a model-based sense [13]. 

The input-output or transfer function model is familiar to engineers and scientists 
because it is usually presented in the frequency domain with Laplace transforms. 
Similarly in the discrete-time case, it is called the pulse transfer function model and 
is expressed as 


where A and B are polynomials in z or z 1 , 


(4.96) 


A(z~ l ) = 1 + fliz" 1 + • • • + a Na z~ N ° (4.97) 

B(z~ l ) =bo + biz~ 1 + - + b Nb z~ Nb (4.98) 


If we consider the equivalent time domain representation, then we have a difference 
equation relating the output sequence [y(f)J to the input sequence {u(t)}f We use the 
backward shift operator q with the property that q~ k y(t) = y(t — k ). 

A{q~ l )y{t) = B{q~ l )u(t) (4.99) 


y(t) + a\y(t - 1) H-h a Na y{t - N a ) = b 0 u(t ) H- b N „u(t - N b ) (4.100) 

When the system is excited by random inputs, the models are given by the 
autoregressive-moving average model with exogenous inputs (ARMAXf 

A(q~ l )y(t) = B(q~ l )u(t) + C{q~ l )e{t) (4.101) 

AR X MA 


where A, B, C, are polynomials, and {e(t)} is a white noise source, and 
C(q~ l ) = c 0 + ciq~ l 4-h c Nc q~ Nc 


8 We change from the common signal processing convention of using x(t) for the deterministic excitation 
to u(t) and we include the bo coefficient for generality. 

9 The ARMAX model can be interpreted in terms of the Wold decomposition of stationary times series, 
which states that a time series can be decomposed into a predictable or deterministic component (u(t)) and 
nondeterministic or random component (e(t)) [10]. 
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Th e ARMAX model usually abbreviated by ARMAX(N a , N b , N c ) represents the gen¬ 
eral form for many popular time series and digital filter models. A summary of these 
models follows: 

• Pulse Transfer Function or Infinite Impulse Response (HR) model: C(-) = 0, or 
ARMAX(N a ,N b , 0), that is. 


Mq ')y(t) = B(q 1 )u(t) 

Finite Impulse Response (FIR) model: A(-) = 1, C(-) = 0, or ARMAX(\,N b , 0), 
that is, 


y(t) = B(q l )u(t) 

• Autoregressive (AR) model: B() = 0, C(-) = 1, or ARMAX(N a , 0,1), that is, 

A(q-')y(t) = e(t) 

• Moving Average ( MA ) model: A(-) = I, B(-) = 0, or ARMAXi 1,0, N c ), that is. 


y(t) = C(q~ l )e(t) 

Autoregressive-Moving Average (ARMA) model: B(-) = 0, or ARMAX(N a , 0, 
N c ), that is, 

A(q ')y(t) = C(q ')e(t) 

Autoregressive model with Exogenous Input (ARXy. C(-)= 1, or ARMAX(N a , 
N b , 1), that is, 


A(q- l )y(t) = B(q- l )u(t) + e(t) 

The ARMAX model is shown in Fig. 4.4. ARMAX models can easily be used 
for signal processing purposes, since they are basically digital filters with known 
deterministic (u(t)) and random (e(t)) excitations. 

Since the ARMAX model is used to characterize a random signal, we are interested 
in its statistical properties. The mean value of the output is easily determined by 

A(q~ l )E{y(t)} = B(q~ l )E{u(t)} + C(q~ x )E{e(t)} 


or 


A(q 1 )m y (t) = B(q l )u(t) + C(q x )m e (t) 


( 4 . 102 ) 



124 STATE-SPACE MODELS FOR BAYESIAN PROCESSING 


e(f) 



FIGURE 4.4 ARMAX input-output model. 


Because the first term in the A-polynomial is unity, we can write the mean propagation 
recursion for the ARMAX model as 

m y (t) = (1 - A(q- l ))m y {t) + B(q~ l )u(t) + C(.q- X )m e {t) (4.103) 

N tt N b N c 

m y (t) = - y] amyit - i) + biuit - j) + y] c,m e (t - i) (4.104) 
i=l i=o i=0 

We note that the mean of the ARMAX model is propagated using a recursive digital 
filter requiring N a ,Nb,N c past input and output values. 

The corresponding variance of the ARMAX model is more complex. First, we note 
that the mean must be removed, that is, 

y(t) - m y (t) = [(1 - A(q~ l ))y(t ) + B(q~ l )u(t) + C(q~ x )e(t)] 

- [(1 - A(q- l ))m y (t) + £(<r‘M0 + C(q~ l )m e (t)\ (4.105) 


or 


y(t) - m y (t) = (1 - A(^“ 1 ))(y(t) - m y (t)) + C{q~ l mt) - m e {t)) 
or finally 


A(q-')(y(t) - m y {t)) = C(q-')(e(t) - m e (t)) 


(4.106) 
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that is, y — m y is characterized by an ARMAX(N a ,0, Nf,) or equivalently an ARMA 
model. 

The covariance of the ARMAX model can be calculated utilizing the fact that it is 
essentially an HR system, that is, 

= //(z) = ^ (4.107) 

t =o 

Using this fact and the commutativity of the convolution operator, we have (assuming 
the mean has been removed) 


Ryy(k) = E{y{t)y(t + k)} = E 


T, h(i)e(t - i ) yii(j + k)e{t - j + k ) 
(=0 j =0 


or 


Ry. y (k) = y h(i)h(i + k)E{e(t - i)e(t - i + k)} 

/=o 

+ EE h(i)h(j + k)E{e(t - i)e(t -j + k)} 
m 

The whiteness of {e(t)} gives 


... _ { R ee k = 0 

ee ' jo elsewhere 

therefore, applying this property above we have the covariance of th e ARMAX model 
given by 

R yy (k) = R ee y h(t)h(t + k) for k >0 (4.108) 

i=0 

with corresponding variance 


Ryy(0) = R ee yh 2 (t) 


(4.109) 


We note that one property of a stationary signal is that its impulse response is bounded 
which implies from Eq. 4.108 that the variance is bounded [14]. Clearly, since the 
variance is characterized by an ARMA model ( ARMAX(N a , 0 ,N C )), then we have 


A(z~ l )H(z} = C( Z -')E(z), E{z ) = 4K e 
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TABLE 4.2 ARMAX Representation 

Output Propagation 

y{t) = (1 - A(q- l ))y(t) + B{q~')u(t) + C(q~ l )e{t) 

Mean Propagation 

niy(t) = (1 - A(q~'))m y (t) + B(q~ l )u(t) + C{q~ l )m e {t) 
Impulse Propagation 
h{t) = (1 - A(q~ l ))h(t) + C(q~ l )8(t) 

Variance/Covariance Propagation 

Ryy(k) = R ee Y2 h(i)h(i + k) k> 0 

y — output or measurement sequence 

u — input sequence 

e — process (white) noise sequence with variance R ee 

h — impulse response sequence 

8 = impulse input of amplitude jR ee 

m y — mean output or measurement sequence 

m e — mean process noise sequence 

Ryy = stationary output covariance at lag k 

A — N a -th order system characteristic (poles) polynomial 

B — Nb-th order input (zeros) polynomial 

C — N c - th order noise (zeros) polynomial 

or taking the inverse Z-transform 

h{t) = (1 - A(q- l ))h(t) + C(q~ 1 )8(t) 


N a N c 

h(t ) = - aMt ~i)+J2 c ' S{t ^ c « = 1 ( 4 -: 1 !0) 

(=1 i= 0 

where Sit ) is an impulse of weight s/R ee - So we see that this recursion coupled with 
Eq. 4.108 provides a method for calculating the variance of an ARMAX model. We 
summarize these results in Table 4.2 and the following example. 

Example 4.4 

Consider the difference equation of the previous example with a = 0.5 which is an AR 
model with Aiz) =1 + 0.5z -1 , and R ee = 1. We would like to calculate the variance. 
From Eq. 4.110, we have 

hit) = -0.5 hit - 1) + Sit) —► (—0.5/ 
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and 


Ryy{ 0) = J2 /j2 (i) = t 12 - 0.5 2 + -25 2 - 1-25 2 + -0875 2 -] # 1.333 

i=0 

AAA 

Let us consider a more complex example to illustrate the use of the ARMA model. 

Example 4.5 

Suppose we would like to investigate the structure of an ARMAX(2, 1,1) model with 
the following difference equation and calculate its underlying mean and variance 
propagation relations 

(1 + 3/4 q~ l + 1/8 q~ 2 )y(t) = (1 + 1/8 q~ l )u(t) + (1 + l/16r/ ] )e(t) 

where u(t) = sin27r(0.025)t and e~7V(l,0.01). Then the corresponding mean 
propagation equation is 

(1 + 3/4 q- 1 + l/Sq~ 2 )m y (t) = (1 + 1/8 q~ l )u(t) + (1 + 1/1 6q- l )m e {t) 
for m e (t ) = 1 for all t. The impulse propagation model is 

(1 + 3/4^“ 1 + 1/8 q~ 2 )h(t) = (1 + \/]6q~m[R ee 8(t) for = 0.1 
and the variance/covariance propagation model is: 


Ryy(k) = R ee J2 h{t)h(t + k) k > 0 


This completes the example. AAA 

It should be noted that for certain special cases of the ARMAX model, it is par¬ 
ticularly simple to calculate the mean and covariance. For instance, the MA model 
(ARMAX( 1, 0, N c )) has mean 

m y {i) = E{C(q- l )e(t)} = C(q~ l )m e (t) (4.111) 

and covariance (directly from Eq. 4.108 with h —> c) 

N c 

Ryy(k) = R ee ^ CiCi+ic for k > 0 
i=0 


(4.112) 
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Another special case of interest is the AR (ARMAX(N a , 0,1)) model with mean 

m y (t) = (1 - A(q 1 ))m y (t) + m e (t) (4.113) 


and covariance which is easily derived by direct substitution 

N a 

Ryy(k) = E{y(t)y(t + k)} = (1 - A(q~' ))R yy (k) = - ajRyy(k - i ) for k > 0 

(4.114) 

In fact, the AR covariance model of Eq. 4.114 is essentially a recursive (all-pole) 
digital filter which can be propagated by exciting it with the variance R yy (0) as initial 
condition. In this case the variance is given by 

Ryy( 0) = E{y\t)} = E J a iy {t - i) + >(f) J = - ^ a,%(i) + R ee 

(4.115) 

So combining Eqs. (4.114,4.115), we have that the covariance propagation equations 
for the AR model are given by 


f-E^l a iRyyW + Ree k = 0 
Ryy(k) = \ l ' 

[ ~ J2i= l a iRyy(k ~ i) k > 0 

Consider the following example of calculating the statistics of the AR model. 

Example 4.6 

Consider the AR model of the previous two examples. We would like to determine 
the corresponding mean and variance using the recursions of Eqs. (4.113,4.114) with 
A(q~ l )= 1 + 0.5g -1 . The mean is 


m y {t) = —0.5m y (t - 1) 


and the covariance is 


[— 0 . 5 /^( 1 ) + 1 

[—0.5 Ryy(k~ 1 ) 


k = 
k > 


0 

0 


The variance is obtained directly from these recursions, since 


fl yy (l) = -0.5%,(0) 
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and therefore 

Ryy( 0) = — 0.5 Ryy(]) + 1 
Substituting for /?„,( 1), we obtain 

R y) ,(0) = —0.5(—0.5Ry y (0)) + 1 
or 

Ryy( 0) = 1.333 

as before. AAA 

This completes the section on ARMAX models. 

4.7.2 State-Space and Time Series Equivalence Models 

In this section we show the equivalence between th & ARMAX and state-space models 
(for scalar processes). That is, we show how to obtain the state-space model given the 
ARMAX models by inspection. We choose particular coordinate systems in the state- 
space (canonical forms) and obtain a relationship between entries of the state-space 
system to coefficients of the ARMAX model. An example is presented that shows how 
these models can be applied to realize a random signal. First, we consider the ARMAX 
to state-space transformation. 

Recall from Eq. 4.101 that the general difference equation form of the ARMAX 
model is given by 

N a N b N c 

y(t ) = - a,y(t - i) + y^ btu(t - i) + y^ c,e(t - i) (4.116) 


or equivalently in the frequency domain as 



(4.117) 


where N a > Nb and N c and \e(t)} is a zero-mean white sequence with spectrum given 
by R ee - 

It is straightforward but tedious to show (see [5]) that the ARMAX model can be 
represented in observer canonical form: 


x(t) = A 0 x(t - 1) + B 0 u(t - 1) + W 0 e(t - 1) 
y(t) = C' 0 x(t ) + b 0 u(t ) + c 0 e(t) 


(4.118) 
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where x, u, e, and y are the N a -state vector, scalar input, noise, and output with 




~aN a bo 


-UN a C0 

o 1 — a/v a 


—aN h +\bo 


-a.N c+ \co 

_ In b - i 1 _a i _ 

Bo := 

bN b ~ aN b bo 

Wo := 

cn c — a/v c co 



_b\- a\bo 


_ ci - aic 0 _ 


C' 0 := [0 • • • 0 1] 


Noting this structure we see that each of the matrix or vector elements 
{Aifl a , Bj, Wj, Cj] i = 1,... ,N a can be determined from the relations 

A ma = -at i=l,...,N a 
Bi = b{ - a/h 0 
Wi = Ci - a,c 0 

Ci = S(N a -i) (4.119) 

where 

bj = 0 for i > Nb 
Cj = 0 for i > N c 
S(i—j ) is the Kronecker delta 

Consider the following example to illustrate these relations. 


Example 4.7 

Let N a = 3, Nb = 2, and N c = 1; then the corresponding ARM AX model is 
y(t) = —a\y(t - 1) - a 2 y(t - 2) - a 2 y(t - 3) + b 0 u(t) 

+ b x u{t - 1) + b 2 u(t - 2) + c 0 e(t) + ci e{t - 1) (4.120) 

Using the observer canonical form of Eq. 4.118 we have 



0 0 | —a 3 " 



-a 3 b 0 

x(t) = 

1 0 |-fl 2 

x(t - 1) + 

b 2 

- a 2 b 0 


0 1 | -ai_ 


bi 

- aib 0 _ 


-a3C 0 

-a 2 co 


e{t - 1) 


ci - aic 0 _ 
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y(t) =[0 0 \]x(t) + b 0 u(t) + c 0 e(t) 
This completes the example. 


AAA 


It is important to realize that if we assume that [e(t)} is Gaussian, then th e ARMAX 
model is equivalent to the innovations representation of the previous section, that is, 

x(t ) = Ax{t - 1) + Bu(t - 1) + We(t - 1) 

y(t) = C'x(t) + bou(t) + coe(t) (4.121) 

where in this case, K —> W,D —»■ ho, and 1 —»■ co. Also, the corresponding covariance 
matrix becomes: 


„* /fWfe(f) "IN 

«„ ■■= “V ([co^W J) = 



~WR ee W' 

1 WR ee c 0 " 


CoReeW' 

| C 0 R ee C0 _ 


This completes the discussion on the equivalence of the general ARMAX to state- 
space. 

Next let us develop the state-space equivalent model for some special cases of 
the ARMAX model presented in the previous subsection. We begin with the moving 
average MA 

N c 

y(t) = cieit - i) or Y(z) = C(z)E(z) = (1 + ciz” 1 + • • • + c Nc z~ Nc )E(z) 


Define the state variable 


Xj(t — 1) := e(t — i — 1), i = 1,...,N C (4.122) 

and therefore, 

Xi (t) = e(t - i) = Xi-x{t - ;=1,...,1V C (4.123) 

Expanding this expression, we obtain 

xi (t) = e(t - 1) 

x 2 (t) = e(t - 2 ) =xi(t-l) 

X3 (t) = e(t - 3) = x 2 {t - 1) 


XN c (t) = e(t - N c ) = x Nc -\(t - 1 ) 


( 4 . 124 ) 
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or in vector-matrix form 


' X] (t) ' 

x 2 (t) 


0 

1 

0 | 0" 

0 | 0 

~X\{t - 1)' 

x 2 (t - 1) 

+ 

T 

0 

-XN c (t)_ 


0 

1 1 0 _ 

_x Nc (t - 1)_ 


_ 0 _ 


xi(t- 1)' 

X2(t — 1 ) 

y(t) = [ci c 2 ■ ■ ■ CN c 1 I . I + C 0 e(t) 

\_XN c (t - 1)J 

Thus the general form for the moving average ( MA ) state-space is given by 


0 


0 | 0‘ 

I : 

I Oj 


x(f- l) + bi e(t- 1) 


y(t) = c'x(t) + c 0 e(t) 

with N x =N C , bi.c e R Nxx 1 , bi a unit vector. 


Example 4.8 

Suppose we have the following MA model 

y(t) = c 0 e(t) + c\e(t -1) + c 2 e(t - 2) 

and we would like to construct the equivalent state-space representation. Then we 
have N x = 2, and therefore, 


x ( 0=p ®]x(t-l)+[j]e(t-l 
y(t) = [ci c 2 ] x(f) + c 0 e(t) 


Consider the AR model (all-pole) given by 

N a 

a / y (t ~ 0 = ae(t '> or = 707 


(4.127) 
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Here the state vector is defined by x,(f — 1 )=y(t — i — 1) and therefore, 
Xi(t) = y{t - i) = x i+ 1 (t - 1); i = 1, ..., N a — 1 with XN a (t) = y(t). Expanding over 
i, we obtain the vector-matrix state-space model 


~x t (t)- 

X 2 (t) 


- 0 

0 

1 1 

1 0 

0 

1 

0 ' 
0 

~ X\(t 1)' 

X2 (t ~ 1) 


'O' 

0 


- 

0 

1 0 

0 

1 


+ 


_XN a {t)_ 


_ a N a 

~ a N a -\ 

—aN a - 2 ’ ’ ’ 

-a\_ 

_XN a (f ~ 1)_ 


_a_ 


y(*) = [0 o 


i] 


r x\{t- 1) - ] 
x 2 (t- 1) 


\-XN a (t ~ 1)J 


(4.128) 


In general, we have the AR ( all-pole ) state-space model 


x(0 = 


0 


L —flJV a —flJVa-1 


x(t — 1) + b e(t — 1) 


—flJVa-2 

y(t) = c'x(f) (4.129) 

with = /V a , b,ce x 1 . Consider the following example to demonstrate this form. 


Example 4.9 

Given the AR model with N a = 2 and a = \/2, find the equivalent state-space form. 
We have 

y(t) = —a\y(t - 1) - a 2 y(t - 2) + /(2)e(t) 

and therefore, 


*,(/- !) = y(t - 2) 

x 2 (t - 1) = y(t - 1) 


which gives 

•*i(0 = y(t- l)=x 2 (t- 1) 

x 2 (t) = y(t) = -a\y(t - 1) - a 2 y(t - 2) = -a\x 2 (t - 1) - a 2 x\{t - 1) 
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or more succinctly 


xW = [-« - 1 J X< '- 1)+ [^] C( '- 1) 

yW = [0 !]*(» AAA 

Another useful state-space representation is the normal form which evolves by 
performing a partial fraction expansion of a rational discrete transfer function model 
( ARMA ) to obtain 

t l Y (z~ l ) ^ Ri 

<* (4130) 

for {Ri,pi}; i= 1,..., N p the set of residues and poles of Note that the normal 

form model is the decoupled or parallel system representation based on the following 
set of relations 


yft) -pm(t - 1) = eft), i=\,...,N p 
Defining the state variable as xft) \=yft), then equivalently 

Xi(t) -p lXi (t - 1) = e(t), i = 1,.. .,N P (4.131) 

and therefore, the output is given by 

N p N p 

y(t) = XJ Rmit) = J2 R ‘ x ‘ (t h i=l,...,N p (4.132) 


Expanding these relations 

over i 

, we obtain 



~xi (t) ” 


PI 

0 

... o " 

~ Xl (t- 1 )" 


x 2 (t) 

= 

0 

P2 

... o 

x 2 (t - 1 ) 


_XN p (t)_ 


_0 

0 

’ ’ ’ PN p _ 

_x Na ft - 1 )_ 


e(t - 1 ) 


xi {t- 1) 
x 2 (t - 1) 


_x Np {t - 1 )_ 


y{t) = [Rx R 2 


(4.133) 
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Thus, the general decoupled form of the normal s\ 

0 " 

0 


’-space model is given by 


x(t) = 


~p\ 0 

0 P2 


0 0 


PNp 


x(t — 1) + b e(t — 1) 


y(t) = c'x(f) (4.134) 

for b e 1Z n p x 1 with b = l, a -vector of unit elements. Here celZ 1 xN/> and 
c' = [Ri R 2 ••• %]. 

Example 4.10 

Consider the following set of parameters and model with N X =N P = 3 and 

y,(t) = pmlt - i) + e(t - i) 

3 

v(0 = X) RmiO 

Using the normal state-space form structure above, we obtain by inspection 



>1 

0 

0" 


V 

X(0 = 

0 

P2 

0 

x(t - 1) + 

1 


_0 

0 

p 3 _ 


1 


y(t)=[Ri R 2 R 3 W) AAA 


4.8 NONLINEAR (APPROXIMATE) GAUSS-MARKOV 
STATE-SPACE MODELS 

Many processes in practice are nonlinear rather than linear. Coupling the non- 
linearities with noisy data makes the signal processing problem a challenging one. In 
this section we develop an approximate solution to the nonlinear modeling problem 
involving the linearization of the nonlinear process about a “known” reference tra¬ 
jectory. We limit our discussion to discrete nonlinear systems. Continuous solutions 
to this problem are developed in [6-13]. 

Suppose we model a process by a set of nonlinear stochastic vector difference 
equations in state-space form as 

x(t ) = a[x{t - 1)] + b[u(t - 1)] + w{t - 1) (4.135) 

with the corresponding measurement model 


y{t) = c[x(t)\ + v(t) 


( 4 . 136 ) 
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FIGURE 4.5 Linearization of a deterministic system using the reference trajectory defined 
by (x*( 0 ,u*(f)). 


where a[-], &[•], c[-] are nonlinear vector functions of x, u, with x, a,b,w e R N * x 1 , 
y, c, v &R N y xl and w ~ Af(0,R ww (t - 1)), v ~ Af(0,R vv (t)). 

Ignoring the additive noise sources, we “linearize” the process and measurement 
models about a known deterministic reference trajectory defined by [x*(t), u*(t )] as 
illustrated in Fig. 4.5 10 , that is, 

x*(t) = a[x*{t - 1)] + b[u*(t - 1)] (4.137) 


Deviations or perturbations from this trajectory are defined by: 


Sx(t) := x(t) - x*(t) 
8u(t) := u{t ) - u*(t) 


Substituting the previous equations into these expressions, we obtain the perturbation 
trajectory as 

8x(t) = a[x(t — 1)] — a[x*{t — 1)] + b[u{t — 1)] — b[u*(t — 1)] + w(t — 1) (4.138) 


10 In practice, the reference trajectory is obtained either by developing a mathematical model of the process 
or by simulating about some reasonable operating conditions to generate the trajectory using the state-space 
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The nonlinear vector functions «[•] and b[-\ can be expanded into a first order 
Taylor series about the reference trajectory [ x*(t ), u*(t)] as 11 

a[x(t - 1)] = a[x*(t - 1)] + - 1) + H.O.T. 

dx*{t — 1) 

b[u(t - 1)] = b[u*(t - 1)] + db l U y~' n Su(t - 1) + H.O.T. (4.139) 
du*(t — 1) 

We define the first order Jacobian matrices as 


A[x*(t - 1)] := 


da[x*(t - 1)] 
dx*(t — 1) 


and 


B[u*(t - 1)] := 


db[u*{t - 1)] 
du*(t - 1) 


(4.140) 


Incorporating the definitions of Eq. 4.140 and neglecting the higher order terms 
(H.O.T.) in Eq. 4.139, the linearized process model in Eq. 4.138 can be expressed as 

8x(t ) = A[x*{t — 1 )]<$x(t — 1) + B[u*(t — 1 )]<5n(t — 1) + w(t — 1) (4.141) 


Similarly, the measurement system can be linearized by using the reference 
measurement 

y*(t) = c[x*(t )] (4.142) 

and applying the Taylor series expansion to the nonlinear measurement model 

c[xm = c[x*m + dC ^ ] Sx(t) + H.O.T. (4.143) 

The corresponding measurement perturbation is defined by 

Sy(t) := y(t) - y*(t) = c[x(t)] - c[x*(t)\ + v(t) (4.144) 

Substituting the first order approximation for c[x{t )] leads to the linearized measure¬ 
ment perturbation as 

Sy(t') = C[x*{t)\8x(t) + v(t) (4.145) 

where C[x*(t)] is defined as the measurement Jacobian as before. 

Summarizing, we have linearized a deterministic nonlinear model using a first- 
order Taylor series expansion for the functions, a, b, and c and then developed a 
linearized Gauss-Markov perturbation model valid for small deviations given by 

Sx(t ) = A[x*(t - l)]8x(t - 1) + B[u*(t - l)]8u(t - 1) + w(t - 1) 

8y(t ) = C[x*{t)\8x{t) + v(t) (4.146) 

with A, B and C the corresponding Jacobian matrices and w, v zero-mean, Gaussian. 


11 We use the shorthand i 
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We can also use linearization techniques to approximate the statistics of the process 
and measurements. If we use the first-order Taylor series expansion and expand about 
the mean, m x {t), rather than x*(t), then taking expected values 

m x (t) = E{a[x(t - 1)]} + E{b[u{t - 1)]} + E{w(t - 1)} (4.147) 


gives 


m x {t) = a[m x (t - 1)] + b[u{t - 1)] (4.148) 

which follows by linearizing a[ ~\ about m x and taking the expected value to obtain 
da[m x (t - 1)] 

E{a[x{t - 1)]} = E{a[m x (t - 1)] + xK ’\ x{t - 1) - m x (t - 1)]} 
dm x (t - 1) 

= a[m x (t - 1)] + da \ mxi , t 1)] [£{x(f - 1)} - m x (t - 1)] 
dm x (t - 1) 

= a[m x (t - 1)] 


The variance equations P{t) := cov(x(t)) can also be developed in similar manner 
(see [2] for details) to give 

P{t) = A[m x (t - 1 )]P(t - 1 )A'[m x (t - 1)] + R ww {t - 1) (4.149) 

Using the same approach, we arrive at the accompanying measurement statistics 

m y {t) = c[m x (t)] and R yy (t) = C[m x (t)]P(t)C’[m x (t)] + R vv (t) (4.150) 

We summarize these results in an “approximate” Gauss-Markov model of Table 4.3. 
Before we close consider the following example to illustrate the approximation. 

Example 4.11 

Consider the discrete nonlinear process given by 

x(t) = (1 - 0.05A7>(t - 1) + 0.04A7x 2 (f - 1) + w(t - 1) 
with corresponding measurement model 

y{t) = x\t)+x\t) + v(t) 

where w(t ) ~ 0, R ww ), v(t) ~ 7V"(0, R vv ) and x(0) ~ Af(x( 0), P(0)). Performing the 

differentiations we obtain the following Jacobians: 

A[x(t - 1)] = 1 - 0.05A7 1 + 0.08A7x(t - 1) and C[x(f)] = 2 x(t) + 3 x 2 (t) 


AAA 
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TABLE 4.3 Approximate Nonlinear Gauss-Markov Model 


State Propagation 

x(t) = a[x(t - 1)] + b[u(t - 1)] + w(t - 1) 

State Mean Propagation 
m x (t) = a[m x (t - 1)] + b[u{t - 1)] 

State Covariance Propagation 
P(t) = A[m x {t - \)]P(t - l)A'\m x (t - 1)] + R ww (t - 1) 


y(t) = c[x(t)] + v(t) 


Measurement Propagation 


Measurement Mean Propagation 


m y (t) = c[m x (t)] 

Measurement Covariance Propagation 
Ryyit) = C[m x (t)]P(.t)C[m x (.t)] + R vv (t) 


x(0) and P(0) 


Initial Conditions 


A[x*(t - 1)] p 


daWf - 1)] I 
dx{t - 1) | JC=A .» (f _ 1) 


Jacobians 


c[x*(.m = 


dc[xm | 
dx(t) |_. ( 


Although the linearization approach discussed here seems somewhat extraneous 
relative to the previous sections, it becomes a crucial ingredient in the classical 
approach to (approximate) nonlinear estimation of the subsequent chapters. We dis¬ 
cuss the linear state-space approach (Kalman filter) to the estimation problem in great 
detail in the next chapter and then show how these linearization concepts can be used 
to solve the nonlinear estimation problem in the following chapter. There the popu¬ 
lar “extended” Kalman filter processor relies heavily on the linearization techniques 
developed in this section for its development. 

4.9 SUMMARY 

In this chapter we have discussed the development of continuous-time, sampled-data 
and discrete-time state-space models. The stochastic variants of these three types of 
models were presented leading to the Gauss-Markov representations for both linear 
and (approximate) nonlinear systems. The discussion of both the deterministic and 
stochastic state-space models included a brief development of their second order 
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statistics. We also discussed the underlying discrete systems theory as well as a 
variety of time series models ( ARMAX , AR, MA, etc.) and showed that can easily be 
represented in state-space form through the use of canonical forms (models). These 
models form the embedded structure incorporated into the majority of the Bayesian 
processors that will be discussed in subsequent chapters. 


MATLAB NOTES 

MATLAB has many commands to convert to/from state-space models to other 
forms useful in signal processing. Many of them reside in the Signal Process¬ 
ing and Control Systems toolboxes. The matrix exponential invoked by the 
expm command is determined from Taylor/Pade’ approximants using the scal¬ 
ing and squaring approach of section 4.2. Also the commands expmdemol, 
expmdemo2, and expmdemo3 demonstrate the trade-offs of the Pade’, Taylor 
and eigenvector approaches to calculating the matrix exponential. The ordinary 
differential equation method is available using the wide variety of numerical inte¬ 
grators available (ode*). Converting to/from transfer functions and state-space is 
accomplished using the ss2tf and tf2ss commands, respectively. ARMAX sim¬ 
ulations are easily accomplished using the filter command with a variety of 
options converting from armax-to/from transfer functions. The Identification Tool¬ 
box converts polynomial-based models to state-space and continuous parameters 
including Gauss-Markov to discrete parameters (th2ss, thc2thd, thd2thc). The 
the Third Party Toolbox SSPACK_PC converts continuous-time models to dis¬ 
crete (SSCTOD) performs Gauss-Markov (linear and nonlinear) simulations as 
well as innovations-based simulations (SSISIM and conversions from GM to 
innovations models (INVTOGM, GMTOINV)). See http:www.techni-soft.net 
for details. 
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PROBLEMS 

4.1 Suppose the stochastic process {y(t)} is generated by 

y(t ) = a exp(— t) + ct, a, b random, then 

(a) What is the mean of the process? 

(b ) What is the corresponding covariance? 

(c) Is the process stationary, if E{a] = E{b}= 0, and E{ab\ = 0. 

4.2 Suppose x, y, z are Gaussian random variables with corresponding means m x , 
m y , m- and variances R xx , R yy , R zz show that: 

(a) If y = ax + b, a, b constants, then y ~ N{am x + b, a 2 R xx ). 

(b) If x and y are uncorrelated, then they are independent. 
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(c) If x(i) are Gaussian with mean m(i ) and variance R xx (i X then for 

y=^2 Kjx(i), y ~ N ^ K,m(i), ^R xx (i) 

(d) If x and y are jointly (conditionally) Gaussian, then 

E{x\y] =m x + RxyR-yiy + m y ), and 
R x \y = R xx + RxyRyy Ryx 

( e ) The random variable x = E{x\y} is orthogonal to y. 

(f) If y and z are independent, then 

E{x\y,z} = E{x\y} + E{x\z} - m x 

(g) If y and z are not independent, show that 

E{x\y,z] = E{x\y,e] = E{x\y} + E{x\e] - m x 
for e = z - E{x\y}. 

4.3 Assume y(f) is a zero mean, ergodic process with covariance Ryy{k), calculate 
the corresponding power spectra, Syy(z) if 

(a) Ryy(k)=CaW. 

(b) Ryy(k)=C cos(w\k\), \k\ < f. 

(c) Ryy(k) = C exp(-a m ). 

(Hint: Recall the sum decomposition : S yy (z) = 5+(z) + S^ y (z)—R yy (0) with 
5+ the one-sided Z-transform and S~(z) = S^ y (zr 1 ) 

4.4 Develop a MATLAB program to simulate the ARMA process 

y(t) = —ay(t - 1) + e(t) 

where a = 0.75, e ~ N( 0,0.1) for 100 data points. 

(a) Calculate the analytic covariance Ryy(k). 

(b) Determine an expression to “recursively” calculate, R yy (k). 

(c) Plot the simulated results and construct the ±2 v //? yy (0) bounds. 

(d) Do 95% of the samples fall within these bounds? 

4.5 Develop the digital filter to simulate a sequence, y(t) with covariance 
Ryy(k) = 4e -3|fc| . Perform the simulation using MATLAB. 

(Hint: Recall the spectral factorization (Wiener): Syy(z) =H(z) x H(z~ l ) 
where the poles and zeros of H(z) lie inside the unit circle.) 
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Using the “realized” digital filter perform a simulation as in the previous 
example and check the validity of the samples lying within bounds. 

4.6 Suppose we are given a zero-mean process with covariance 

Ryy(k ) — 10 exp(—0.5|k|) 


(i a ) Determine the digital filter which when driven by white noise will yield a 
sequence with the above covariance. 

( b ) Develop a computer program to generate y(t) for 100 points. 

(c) Plot the results and determine of 95% of the samples fall within 

±2 y /^(0). 

4.7 Suppose we are given the factored power spectrum S yy (z) = H{z)H{z~ x ) with 


H(z) = 


•Mir 1 filit 

1 + a\z~ l + 0L2 z~ 2 


(i a ) Develop the ARMAX model for the process. 

(b ) Develop the corresponding Gauss-Markov model for both the standard 
and innovations representation of the process. 


4.8 Suppose we are given a causal LTI system characterized by its impulse 
response, hit). If this system is excited by zero-mean, unit variance white 
noise, then 

(a) Determine the output variance, R yy (0); 
ib) Determine the covariance, Ryyik) for k> 0; 

(c) Suppose the system transfer function is given by 


H(z) = 


1 + b 0 z~ l 
1 + aiz~ l + ct2Z~ 2 


find a method to recursively calculate h(t) and therefore R yy (0). 

4.9 Given the covariance function 

R^ik) = e ~ l/2W cos 7r|k|, 

find the digital filter when driven by unit variance white noise produces 
sequence {y(t)} with these statistics. 

4.10 Suppose we have a process characterized by difference equation 


yif) = xit) + l/2x(f - 1) + 1/3 xit - 2) 


(a) Determine a recursion for the output covariance, Ryyik). 
ib) If xit) is white with variance o 2 x , determine Ryyik). 

(c) Determine the output PSD, S yy iz). 
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4.11 We are given a linear system characterized by the difference equation 

y(t) - 1/5 y(t - 1) = ' *(f) 
v 3 

and the system is excited by: 

(1) white Gaussian noise, x ~ N(0, 3); 

(2) exponentially correlated noise, R ee {k) = (1 /2) 1 * 1 
In both cases find: 

(a) Output PSD, Syy(z); 

(b ) Output covariance, R yy ik)\ 

(c) Cross-spectrum, S ye {k)\ 

(d) Cross-covariance, R ye (k). 

4.12 We are given the following Gauss-Markov model 

x{t ) = 1/3 x{t — 1) + 1/2 w(t — 1) 
y(A = 5 x(t) + v(t) 

w ~ ?V(0,3) v ~ WO, 2) 

(a) Calculate the state power spectrum, S xx (z). 

(b) Calculate the measurement power spectrum, S yy (z). 

(c) Calculate the state covariance recursion, P(t). 

(d) Calculate the steady-state covariance. Pit) = • • • —P = P ss . 

(e) Calculate the output covariance recursion, Ryyit). 

(/) Calculate the steady-state output covariance, Ryy. 

4.13 Suppose we are given the Gauss-Markov process characterized by the state 
equations 

x{t) = 0.91x(t - 1) + u{t - 1) + w(t - 1) 
for u{t) a step of amplitude 0.03 and w ~ N( 0,10 -4 ) andx(0) ~ N(2.5, 10 -12 ). 

(a) Calculate the covariance of x, i.e., Pit) = Cov(x(t)). 

(b) Since the process is stationary, we know that 

Pit + k) = P(t + k - 1) = • • • = P( 0) = P 

What is the steady state covariance, P of this process? 

(c) Develop a MATLAB program to simulate this process. 

(d) Plot the process xit) with the corresponding confidence limits ±2 y/P(t) 
for 100 data points, do 95% of the samples lie within the bounds? 
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4.14 Suppose we are given the ARMAX model 

y(t) = -0.5 y(t - 1) - 0.7 y(t - 2) + u(t) + 0.3 u(t - 1) 

+ e(t) + 0.2 e(t - 1) + 0.4e(t - 2) 

(a) What is the corresponding innovations model in state-space form for 
e~iV(0,10)? 

(. b ) Calculate the corresponding covariance matrix R* e . 

4.15 Given the following ARMAX model 

A(q~ l )y(t ) = B(q~ l )u(t ) + \ e(t) 

D(q-') 

for q~ l the backward shift (delay) operator such that 

A(q~ l ) = 1 + 1.5<7 -1 + 0Jq~ 2 
B(q ~ l ) = 1 + 0.5^ _1 
C(q~ l ) = 1 + 0.7 q~ l 
D(q ~ l ) = 1 + 0.5^ _1 

(a) Find the pulse transfer representation of this process (C = D = 0). Convert 
it to the following equivalent pole-zero and normal state-space forms. Is 
the system controllable? Is it observable? Show your calculations. 

(b) Find the pole-zero or ARX representation of this process (C = 1,D = 0). 
Convert it to the equivalent state-space form. 

(c) Find the pole-zero or ARMAX representation of this process (D = 0). 
Convert it to the equivalent state-space form. 

(d) Find the all-zero or FIR representation of this process (A = 1,C = D = 0). 
Convert it to the equivalent state-space form. 

(e) Find the all-pole or HR representation of this process (B = 0,C = 0,D = 0). 
Convert it to the equivalent state-space form. 

(f) Find the all-zero or MA representation of this process (A=1,B = 0,D = 0). 
Convert it to the equivalent state-space form. 

{g ) Using the full model above with A, B, C, D polynomials, is it possible to 
find and equivalent Gauss-Markov representation? If so, find it and convert 
to the equivalent state-space form. (Hint: Consider the C/D polynomials 
to be a coloring filter with input e(t) and output e(t).) 

4.16 Given a continuous-discrete Gauss-Markov model 

x t = ax t + u t -\-w t 
y(tk) = j 6x(tk) + v(t k ) 
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where w t and u(4) are zero-mean and white with respective covariances, R ww 
and R vv , along with a piecewise constant input, u t . 

(a) Develop the continuous-discrete mean and covariance propagation models 
for this system. 

(b) Suppose w(t) is processed by a coloring filter that exponentially correlates 
it, R ww (t) = Ge~^ z . Develop the continuous-discrete Gauss-Markov 
model in this case. 

4.17 Develop the continuous-discrete Gauss-Markov models for the following 
systems: 

(a) Wiener process: i t = wy; zo = 0,w is zero-mean, white with R ww . 

(b) Random bias: it = 0; zo = z 0 where z 0 ~.A/"(0, R ZoZo ). 

(c) Random ramp: 'it = 0; zo = zi; zo = z 0 

(d) Random oscillation: z t + co 2 zt = 0; io = z .\; zo = z 0 

( e ) Random second order: % + 2 t;co n i + a> 2 n z,t = u> 2 w t \io = zi ; zo = Zo 

4.18 Develop the continuous-discrete Gauss-Markov model for correlated process 
noise, that is, 

= A cw w t + B cw u t + W cw w* for ui* ~ Af(Q, R VJ *w*) 

4.19 Develop the approximate Gauss-Markov model for the following nonlinear 
state transition and measurement model are given by 

x(t) = \x{t - 1) + 25x( 2 f —+ 8cos(1.2(f - 1)) + w(t - 1) 

2 1 +x J -(t - 1) 

x 2 (t ) 

w) = + m 

where w ~ Af(0, R ww (t — 1)) and v ~ J\f( 0, R vv (t)). The initial state is Gaussian 
distributed with x(0) ~ J\f( 0, P(0)). 

4.20 Consider the discrete nonlinear process given by 

x(t) = (1 - 0.05A7>(t - 1) + 0.04A7x 2 (t - 1) + w(t - 1) 
with corresponding measurement model 

y(t) = x 2 (t) + x 3 (t} + v(t) 

where w ~ Af(0, R ww (t — 1)) and v ~ J\f( 0, R vv (t)). The initial state is Gaussian 
distributed with x(0) ~ Af( 0, P(0)). 

Develop the approximate Gauss-Markov process model for this nonlinear 
system. 
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CLASSICAL BAYESIAN 
STATE-SPACE PROCESSORS 


5.1 INTRODUCTION 

In this chapter we introduce the concepts of statistical signal processing from 
the Bayesian perspective using state-space models. We first develop the Bayesian 
paradigm using the generic state-space representation of the required conditional 
distributions and show how they propagate within the Bayesian framework. Next we 
start with the linear (time-varying) Gauss-Markov model and develop the required 
conditional distributions leading to the well-known Kalman filter processor [ 1 ]. Based 
on this fundamental theme, we progress to the idea of linearization of the non¬ 
linear state-space system developed in the previous chapter, where we derive the 
linearized Bayesian processor ( LZ-BP ). It is shown that the resulting processor pro¬ 
vides a solution (time-varying) to the nonlinear state estimation. We then develop 
the extended Bayesian processor ( XBP ) or equivalently the extended Kalman filter 
( EKF ), as a special case of the LZ-BP linearizing about the most currently available 
estimate. Next we investigate a further enhancement of the XBP by introducing a 
local iteration of the nonlinear measurement system. Here the processor is called 
the iterated-extended Bayesian processor ( IX-BP ) and is shown to produce improved 
estimates at a small computational cost in most cases. We summarize the results with 
a case study implementing a 2D-tracking filter. 

5.2 BAYESIAN APPROACH TO THE STATE-SPACE 

In the previous chapter, we briefly developed deterministic and stochastic (Gauss- 
Markov) state-space models and demonstrated how the states propagate through 
the state transition mechanism for both continuous and discrete systems and their 
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variants. Here we again take a Bayesian perspective and assume that the state or 
dynamic variables evolve according to a “probabilistic” transition mechanism. 

Bayesian estimation relative to the state-space models is based on extracting 
the unobserved or hidden dynamic (state) variables from noisy measurement data. 
The Markovian state vector with initial distribution, Pr(x(0)), propagates tempo¬ 
rally throughout the state-space according to the probabilistic transition distribution, 
Pr(x(t)|x(t — 1)), while the conditionally independent measurements evolve from the 
likelihood distribution, Pr(y(t)\x(t)). We see that the dynamic state variable at time t 
is obtained through the transition probability based on the previous state (Markovian 
property), x(t — 1), and the knowledge of the underlying conditional probability. Once 
propagated to time t, the dynamic state variable is used to update or correct based on 
the likelihood probability and the new measurement, y{t). Note that it is the knowledge 
of these conditional distributions that enable the Bayesian processor. 

Returning to the usual model-based constructs of the dynamic state variables dis¬ 
cussed in the previous chapter, we see that there is an implied equivalence between the 
probabilistic distributions and the underlying state/measurement transition models. 
Recall from Chapter 4 that the functional discrete state representation is given by 

x(t) = A(x(t - 1), u(t - 1), w(t - 1)) 

y{t) = C{x{t),u{t),v{t)) (5.1) 

where w and v are the respective process and measurement noise sources with u a 
known input. Here A( ) is the nonlinear (or linear) dynamic state transition function 
and C(-) the corresponding measurement function. Both conditional probabilistic 
distributions embedded within the Bayesian framework are completely specified by 
these functions and the underlying noise distributions: Pr(iu(t — 1)) and Pr(u(f)). That 
is, we have the (implied) equivalence 1 

A(x(t - 1), u(t - 1), w{t - 1)) =* Vr(x(t)\x(t - 1)) & A(x(t)\x(t - 1)) 

C(x(t),u(t),v(t)) =» Pr(y(t)|x(0) O C(y(t)\x(t)) (5.2) 

Thus, the state-space model along with the noise statistics and prior distributions 
define the required Bayesian representation or probabilistic propagation model which 
defines the evolution of the states and measurements through the transition probabil¬ 
ities. This is sometimes a subtle point that must be emphasized. As illustrated in 
the diagram of Fig. 5.1, the dynamic state variables propagate throughout state- 
space specified by the transition probability A{x{t)\x{t — 1)) using the embedded 
process model. That is, the “unobserved” state at time t— 1 requires the transition 
probability distribution to propagate to the state at time t. Once evolved, the state 
combines under the corresponding measurement at time t through the conditional 
likelihood distribution C(y(t)\x(t )) using the embedded measurement model to obtain 


1 We use this notation to emphasize the influence of both process (A) and measurement (C) representations 
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y(t- 1) y(t) y(t+l) 



FIGURE 5.1 Bayesian state-space probabilistic evolution. 


the required likelihood distribution. These events continue to evolve throughout with 
the states propagating through the state transition probability using the process model 
and the measurements generated by the states and likelihood using the measurement 
model. From the Bayesian perspective, the broad prior is scaled by the evidence and 
“narrowed” by the likelihood to estimate the posterior. 

With this in mind we can now return to the original Bayesian estimation problem, 
define it and show (at least conceptually) the optimal solution based on the state-space 
representation. 

The basic dynamic state estimation (signal enhancement) problem can now be 
stated in the Bayesian framework as: 

GIVEN a set of noisy uncertain measurements, (y(f)}, and known inputs, {u(t)}, 
1 = 0, ..., N along with the corresponding prior distributions for the initial state and 
process and measurement noise sources: Pr(x(0)), Pr(w(f — 1)), Pr(u(t)) as well as 
the conditional transition and likelihood probability distributions: Pr(x(t)|x(t — 1)), 
Pr(y(t)\x(t)) characterized by the state and measurement models: A(x(t)\x(t — 1)), 
C(y(t)\x(t)), FIND the “best” estimate of the filtering posterior, Pv(x(t)\Y t ), and its 
associated statistics. 

It is interesting to note that the entire Bayesian system can be defined by the set 
E as 

E := [{>*)}, {u(t)l Pr(x(0)), Pr(w(t - I)), Pr(v(t)), A(x(t)\x(t - \)),C(y(t)\x(t))] 

Compare this to the model-based solutions to follow where we obtain closed form 
analytic expressions for these distributions. 

Analytically, to generate the model-based version of the sequential Bayesian pro¬ 
cessor, we replace the transition and likelihood distributions with the conditionals 
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of Eq. 5.2. The solution to the signal enhancement or equivalently state estimation 
problem is given by the filtering distribution, Pr(x(t) | Y t ) which was solved previously 
in Sec. 2.5 (see Table 2.1). We start with the prediction recursion characterized by the 
Chapman-Kolmogorov equation replacing the transition probability with the implied 
model-based conditional, that is, 

Embedded Process Model Prior 

Pr(x(0|T r _ 1 ) = y A(x(t)\x(t - 1))' x Vr(x(t — ~l )| Y,- \ ) dx(t — I) (5.3) 

Next we incorporate the model-based likelihood into the posterior equation with 
the understanding that the process model has been incorporated into the prediction 

Embedded Measurement Model Predicition 

Pr(x(f)|7) = CWOWO) x Pr(v(f)|F f _i)/Pr(y(0|F f _i) (5.4) 

So we see from the Bayesian perspective that the sequential Bayesian processor 
employing the state-space representation of Eq. 5.1 is straightforward. Next let us 
investigate a more detailed development of the processor resulting in a closed-form 
solution—the linear Kalman filter. 


5.3 LINEAR BAYESIAN PROCESSOR (LINEAR KALMAN FILTER) 

In this section we constrain the state-space model to be linear (time-varying) and apply 
the Bayesian approach to obtain the optimal processor assuming additive Gaussian 
noise. 

Suppose we are given the linear discrete Gauss-Markov model of the previous 
section (ignoring u with W — I for notational simplicity) and we would like to develop 
the Bayesian processor. Since we know the processes are linear and Gaussian, then we 
know that the required distributions will also be Gaussian. To develop the processor 
for this case, we start with the prediction equation 2 and use the process model of Eq. 
4.78, that is, 

PrWt)|F f _!) = J A(x(t)\x(t - 1)) x Pr (x(t - \ )\Y t ^)dx(t - 1) 

where the filtered conditional 3 is: 

Pr (x(f - 1)17,-,) ~ Mx(t ): x(t - 1 |t - 1 ),P(t - 1| t- 1)) 


2 We have changed Gaussian distribution notation to include the random process, that is, 

3 This notation is defined in terms of conditional means and covariances by: x(,|,) .= E{x(t)\Y,} and 
P(t\t) := cov(x(t |,)) for the state estimation error , x(t\t) := *(,) - x(t\t). 
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Now using the process model, we have that 

A(x(t)\x(t - 1)) ~ Af(x(t) : A(t - 1 )x(t - 1 |t - 1 ),A(t - 1 )P(t - l\t - 1) 
x A'(t — 1) + R ww (t — 1)) 

which follows directly from the linearity of the conditional expectation operator, 
that is, 


x(t\t - 1) = E{x{t)\Y,_ x } = E{A(t - l)xit - 1) + w(t - l)|F r _i] 

= A{t-mt-\\t-\) (5.5) 

Using this result, the predicted state estimation error can be obtained as 

x(t\t ~ l) = x(t) - m ~ 1) 

= [A(t - 1 )x{t - 1) + w(t - 1)] - A(t - 1 )x(t - l\t - 1) 

= A(t — \)x{t — \\t — 1) + w(t — 1) (5.6) 

and the corresponding state error covariance, P{t\t — 1) = E{x(t\t — 1 )x ! (t\t — 1)} is 
easily derived. Summarizing, the conditional means and covariances that completely 
characterize the current (Gaussian) state evolve according to the following equations: 

x(t\t - 1) = A(t - \ )x(t - l\t - 1) [Prediction] 

P(t\t — 1) = A(t — \)P(t — 1 \t — \)A'(t — 1) + R ww (t — 1) [Prediction Covariance] 

Substituting these multivariate Gaussian distributions (transition and filtered) into 
the prediction equation, we have that 

Pr(x(t)|F f _i) ~ M(x{t) : x(t\t - l),P(t\t - 1)) 

With the prediction distribution available, we require the correction or update 
distribution obtained from the likelihood and the measurement model of Eq. 5.4, 
that is, 

p,w.)| 1 -,) = CW)|jW)xPrW)|i '- > 

PtWOIP,-]) 

Under the Gauss-Markov model assumptions, we know that each of the conditional 
distributions can be expressed in terms of the Gaussian distribution as: 

C(y(t)\x(t)) ~ M(y(t ): C(t)x(t),R vv (t)) 

Pr(x(t)|F,_i) ~ Af(x(t): x(t\t ~ 1 ),P(f\t - 1)) 

Pr(y(t)|U_i) ~ M(y(t) : y(t\t - 1 ),R ee (t)) 
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for R ee (t) the innovations covariance with innovations defined by e(t) := y(t) — 
y(t\t — 1) and predicted or filtered measurement given by y(t\t — 1) = C(t)x(t\t — 1). 

Substituting these probabilities into Eq. 5.4 and combining all constants into a 
single constant k, we obtain 


Pr(x(f)|F f ) = k x exp *(y(t) - C(t)x(t))'R^(t)(y(t) - C(t)x(f))J 

x exp - xlt\t - \))'P Ht\l ~ 1)W0 - x(t\t ~ 1))] 

x exp |+^e m - you - 1))'/?,; «)(y(o - m - i»J 

Recognizing the measurement noise, state estimation error and innovation in the above 
terms, we have that the posterior probability is given in terms of the Gauss-Markov 
model by 


Pr(x(f)| Y,) = k x exp ^u , (/)/? 1 „.'(0v(/)] 

x exp ^- l -x'{t\t - \)~p-\t\t - \)m ~ 1)] 
xexp^+ig'W^CfMoj (5-7) 

So we see that the posterior distribution can be estimated under the multivariate 
Gaussian assumptions and the corresponding linear (time-varying) Gauss-Markov 
model. This is the optimal Bayesian processor under these assumptions. In most 
cases we are not able to characterize the distributions in closed form and must resort 
to numerical (simulation-based) solutions. 

We realize at this point that we have the optimal Bayesian predictor and posterior, 
but we still have not extracted the optimal state estimates explicitly and its associated 
performance metric. Recall from the batch Bayesian solutions of Sec. 2.1 that once 
we have the posterior, we can estimate a variety of statistics using it as the basis. In 
this case, the optimal Bayesian processor will be the one that maximizes the posterior; 
therefore, we continue the development of the linear filter by deriving the Bayesian 
MAP estimator. 

Starting with the MAP equation of Eq. 2.3 and taking natural logarithms of each 
side gives 


lnPr(x(f)|F f ) = lnPr(y(0l*(0) + lnPr(x(t)|F r _i) - InPr(y(f)|F,_,) 
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In terms of multivariate Gaussian, the posterior is given by 


In Pr(x(t)\Y t ) = In k 




with 

x(t\t — 1) = x(t) — x(t\t — 1) [State Error] 

e{t) = y(t ) — C(t)x(t\t — 1) = C(t — \)x(t\t — 1) + v(t) [Innovation] 

The MAP estimate is then obtained by differentiating Eq. 5.8, setting it to zero and 
solving, that is, 

V, In Pr (x(t) | Y t ) | ^ = 0 (5.9) 

Using the chain rule of the gradient operator (see Eq. 2.11), we obtain the following 
expression. Note that the last term of Eq. 5.8 is not a function of x{t), but just the 
data; therefore, its gradient is null. 

V* In Pr(x(t)\Y t ) = C'(t)R~ v l (t)\y(t) - C(t)x(t)] - P~\t\t - V)x(t\t - 1) (5.10) 

Setting Eq. 5.10 to zero and solving for x(t) gives the Bayesian MAP estimate 

XmapW = [C'(t)R-Ut)C(t) + p-'(t\t- I)]-' 

x [P~\t\t - V>x(t\t - 1) + C'(t)R- v '(t)y(t)] (5.11) 

This relation can be simplified by using a form of the matrix inversion lemma [1] 
defined by the following equation 

(A + = A -1 - A~ 1 B(I + D'A~ 1 B)~ 1 )D'A~ 1 (5.12) 

Defining the following terms for the lemma, A — P~ l {t\t—\),B — C'{t)R~^{t) and 
setting D' = C(t), we find that 

[P~\t\t - 1) + C\t)R- v l (t)C(t)r l 

= P(t\t - 1) - P(t\t - \)C(t)Rjm + C(t)P(t\t - l)C'(t)Rj(t)) 1 
x C(t)P(t\t — 1) (5.13) 

Making the observation that the term in parenthesis on the right hand side of Eq. 5.13 
can be rewritten by factoring out /?“*(*) as 

(/ + C(t)P(t\t - \)C'{t)R-^t))-{ m R vv (t)(R vv (t) + C{t)P{t\t - 1 )C\t))~ x (5.14) 
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then Eq. 5.14 can also be expressed as 
[P~\t\t- l) + C'W~ v \t)C(t)]- 1 

= P(t\t - 1) - P(t\t - 1 )C\t)(R vv (t) + C(t)P(t\t - \)C\i)Y l C(t)P(t\t - 1) 

(5.15) 

The innovations covariance can be expressed as 

R ee (t) = Cov(e(t)) = C(t)P(t\t - 1 )C\t) + R vv (t) 
and substituting Eq. 5.15 becomes 

[p-\t\t-l) + C'W-\t)C(t)r l 

= P(t\t - 1) - P(t\t - \)C'(t)R- e \t)C(t)P(t\t - 1) = (/ - K(t)C(t))P(t\t - 1) 

(5.16) 

where K(t) = P(t\t — 1 )C’(t)R~}(t) is the gain. We see that Eq. 5.16 is simply the 
updated error covariance, P(t\t), equivalent to 

Pm = [P~\t\t - 1) + C’(t)R^(t)C(t)T ] (5.17) 

Thus we can eliminate the first bracketed term in Eq. 5.11 to give 

^map(f) = Pm x [P~\t\t - \fx(t\t - 1) + C'( t )R- v \t)y(t)] 

Solving Eq. 5.17 for P{t\t — 1), we can substitute the result into the above equation 
to give 

£map if) = Pm x VP~\t\t) - C'(t)R-Ut)C(t)x(t\t - 1) + C'm~ v \t)y(t)] (5.18) 
Multiplying out, regrouping terms and factoring, this relation can be rewritten as 

x map (t) = m -1) + (Pmc'm- V \t))\y(t) - mm - i)i ( 5 . 19 ) 

or finally 

^map(T) = xm = m - 1) + K{t)e{t) (5.20) 

Now we only need to show the equivalence of the gain expression using the updated 
instead of predicted error covariance, that is, 

m = p(t\t - 1 ) = PmP~ l m(p(t\t - d 

= Pm[c'm- v \t)m+P~\t\t-i)]P(t\t-i)cm~ l (t) 

= P(t\t)C'(t)R-^{t)[C(t)P(t\t - V)C'(t) + R vv (t)]R;J(t) 


(5.21) 
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TABLE 5.1 Linear BP (Kalman Filter) Algorithm 


Prediction 

x(t |f - 1) = A(t - \)x(t - 1 \t - 1) + B(t - 1 )u(t - 1) 
P(t\t - 1) = A(t - 1 )P(t - 11 1- 1 )A'(t -l) + R ww (t-l) 

Innovation 

e(t) = fit) - y(t\t - 1) = y(t) - C(t)x(t\t - 1) 

Reeit) = dt)P(t\t - \)C’(t) + R vv (t) 

K{t) = P(t\t - \)C(t)R- ] (l) 

Update 

x(t\t) = x(t\t ~ V + K(t)eit) 

P{t\t) = [7 - K(t)C(t)\Pit\t - 1) 

Initial Conditions 

£(0|0) P(0|0) 


(state prediction) 
(covariance prediction) 


(innovation) 
(innovation covariance) 


(gain or weight) 


(state update) 
(covariance update) 


which gives the desired result from the definition of innovations covariance. We now 
have two equivalent expressions in terms of the updated or predicted error covariances 
that can be used to calculate the gain 

K{t) = Pit\t)C'it)R- l 2 it) = Pit\t - ])C'(t)R-'(t) (5.22) 

which completes the Bayes’ approach to the signal enhancement or equivalently state 
estimation problem yielding the optimum linear Bayesian processor (Kalman filter). 
A summary of the linear BP algorithm is shown in Table 5.1. 

The design of linear Bayesian processors under the Gauss-Markov assumptions 
is well-understood [1-9], Based on a variety of properties both theoretically well- 
founded and pragmatically applied with high success, the minimum (error) variance 
design procedure has evolved [10-14], We summarize the design steps below and 
subsequently using the notation of the Bayesian processor algorithm in Table 5.1. 

It is important to realize that a necessary and sufficient condition that the linear BP 
(under the GM constraints) is optimal is that the innovation sequence is zero-mean 
and white or uncorrelated! This is the first and most important step in BP design. 
If this condition does not hold then the underlying model and GM assumptions are 
invalid. Therefore, we briefly mention the minimum variance design procedure here 
and provide more details in Sec. 5.7 where pragmatic statistical tests are developed. 
We will apply the procedure to the following processors (linear and nonlinear) in the 
example problems and then provide the design details subsequently. 

The minimum (error) variance design procedure is: 

1. Check that the innovations sequence is zero-mean. 

2. Check that the innovations sequence is white. 
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Uncertainty 



3. Check that the innovations sequence is uncorrelated in time with input u. 

4. Check that the innovations sequence is lies within the confidence limits 
constructed from R ee predicted by the Bayesian processor. 

5. Check that the innovations sequence variance is reasonably close to the 
estimated (sample) variance, R ee . 

6. Check that the state estimation error, x(t\t), lies within the confidence limits 
constructed from P(t\t) predicted by the Bayesian processor. 

7. Check that the state error variance is reasonably close to the estimated (sample) 
variance, P{t\t). 

These is the basic “cookbook” approach to linear Bayesian processor design. 
Before we close this section, let us consider a simple linear example to demonstrate 
the ideas. 


Example 5.1 

Suppose we have the RC circuit as shown in Fig. 5.2. We measure the voltage across 
the capacitor with a high impedance voltmeter as shown. Since these measurements 
are noisy and the component values imprecise (±A), we require an improved estimate 
of the output voltage. We develop a BP to solve this problem from first principles—a 
typical approach. Writing the Kirchoff current equations at the node, we have 


knit) - 


g(p _ r de(t) 
R dt 


= 0 


where e a is the initial voltage and R is the resistance with C the capacitance. The 
measurement equation for a voltmeter of gain K e is simply 


<W(T) = K e e(t) 
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We choose to use the discrete BP formulation; therefore, approximating the deri¬ 
vatives with first differences and substituting, we have 


^e(t)-e(t- 1) 
AT 


e(t - 1) 
R 


+ /in (t ~ 1) 


or 

/ A T\ AT 

e(t) = { 1 ~Rc) e(t ~ l)+ ~C ha(t ~ l) 
where the measurement is given above. Suppose that for this circuit the parameters are: 
R = 3.3 kQ and C = 1000 /iF, AT — 100 ms, e a = 2.5 V, K e = 2.0, and the voltmeter 
is precise to within ±4 V. Then transforming the physical circuit model into state- 
space form by defining x = e,y= e out , and u = Ij n , we obtain 


x(t) = 0.97 x(t - 1) + 100w(f - 1) + w(t - 1) 
y(t) = 2x(t)+v(t) 


The process noise covariance is used to model the circuit parameter uncertainty 
with R ww = 0.0001, since we assume standard deviations, A R, AC of 1%. Also, 
R vv = 4, since two standard deviations are AV = 2(j 4V). We also assume initially 
that the state is x(0) ~ Af(2.5, 10“ 12 ), and that the input current is a step function of 
u(t) = 300 jiA. SSPACK_PC is used to simulate this system [8]. The results are shown 
in Fig. 5.3. The simulated and true (mean) states (voltages) are shown in Fig. 5.3a 
along with the corresponding confidence limits. We see that the process samples (state 
and process noise) lie within the bounds (3.5% out). Therefore, the data statistically 
satisfy the underlying Gauss-Markov model assumptions. If it does not, then choose 
another simulation. That is, we perform another realization (different seed in random 
number generator) until the samples lie within the bounds. Similarly, the simulated 
and true (mean) measured voltages are shown in Fig. 5.3b. Again the data (measure¬ 
ment and noise) statistically satisfy the underlying models with only 4.5% of the 
samples exceeding the prescribed bounds. The state and measurement variances used 
to construct the confidence intervals about the means, that is, [m x (t) ± 1.96 */P(t)] 
and [ m y (t ) ± 1 .96^/R y y( n ] are shown in Fig. 5.3c. 

With the data simulated, we now consider the design of the BP. In the ideal BP 
problem, we are given the model set, E := {A,B, C, R ww , R vlh x(0), P(0)j, the known 
input (u(f)} and the set of noisy measurements, {>’(?)} to construct the processor. The 
RC model-based processor for this problem can simply be written as: 

x(t\t - 1) = 0.97 x(t - 1 \t - 1)) + 100m( 7 - 1) [Predicted State] 

P(t\t — 1) = 0.94 P(t — \\t — 1) + 0.0001 [Predicted Covariance] 
e(t) = y(t) — 2x(t\t — 1) [Innovation] 

R ee {t) = 4P{t\t — 1) + 4 [Innovations Covariance] 
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FIGURE 5.3 RC circuit problem Gauss-Markov simulation, (a) Simulated and true (mean) 
output voltage, (b) Simulated and true (mean) measurement, (c) Simulated state and 
measurement variances. 


K{t) = 2 ———-— 

4P(t\t- 1) + 4 

mo = m-i)+me(t) 


Pm = 


p(t\t-i) 

p(t\t-i)+i 


[Gain] 

[Updated State] 
[Updated Covariance] 


The estimator is also designed using SSPACK_PC and the results are shown in 
Fig. 5.4. In Fig. 5.4a we see the estimated state (voltage) and estimation error as 
well as the corresponding confidence bounds. Note that the processor “optimally” 
estimates the voltage, since our models are exact. That is, it provides the minimum 
error variance estimate in the Gaussian case. Also since we have the true (mean) state, 
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FIGURE 5.4 BP design for RC circuit problem, (a) Estimated state (voltage) and 
error, (b) Filtered voltage measurement and error (innovations), (c) WSSR and 
zero-mean/whiteness tests, (d) Gain and updated error covariance. 
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we can calculate the estimation error and use the corresponding error covariance to 
specify the bounds as shown. Note that the error is small and no samples exceed 
the bound as indicated by the overestimation of the variance compared with the 
sample variance (0.0017 > 0.0002). In Fig. 5.4b, we see the filtered measurement 
(y(t\t — 1)) and corresponding innovation sequence along with the confidence limits 
provided by the processor. Here only 4.5% of the samples exceed the bounds and the 
variance predicted by the filter is close to the sample variance estimate (4.0 ~ 3.7). 
The weighted sum-squared residual ( WSSR ) statistic, zero-mean, and whiteness tests 
are shown in Fig. 5.4c. Here we see that using a window of 75 samples, the threshold 
is not exceeded, indicating a statistically white sequence. The innovation mean is 
small and well within the bound (0.11 < 0.27). The sequence is statistically white, 
since 0% of the normalized sample covariances exceed the bound. Finally, we see the 
gain and updated error covariance as monotonically decreasing functions that reach 
a steady-state (constant) value at approximately 8 sec. This completes the example 
of an ideally “tuned” BP. AAA 

5.4 LINEARIZED BAYESIAN PROCESSOR (LINEARIZED 
KALMAN FILTER) 

In this section we develop an approximate solution to the nonlinear processing prob¬ 
lem involving the linearization of the nonlinear process about a “known” reference 
trajectory followed by the development of a Bayesian processor based on the under¬ 
lying linearized state-space model. Many processes in practice are nonlinear rather 
than linear. Coupling the nonlinearities with noisy data makes the signal processing 
problem a challenging one. In this section we limit our discussion to discrete nonlinear 
systems. Continuous solutions to this problem are developed in [1-7]. 

Recall from the previous chapter that our process is characterized by a set of 
nonlinear stochastic vector difference equations in state-space form as 

x(t) = a[x(t - 1)] + b[u{t - 1)] + w(t - 1) (5.23) 

with the corresponding measurement model 

v(0 = c[x(t)] + v(t) (5.24) 

where a[-], (?[•], c[-] are nonlinear vector functions of x, u, with x, a, b, w € R Nxx 1 , 
y,c, veR N y xl and u> ~A/’(0, R ww (t)), v~J\f(0, R vv (tj). 

In Chapter 4 we linearized a deterministic nonlinear model using a first-order 
Taylor series expansion for the functions, a, b, and c and developed a linearized 
Gauss-Markov perturbation model valid for small deviations given by 

8x(t) — A[x*{t — l)]6x(t — 1) + B[u*(t — 1)]<5 u(t — 1) + w(t — 1) 

Sy(t) = C[x*(t)]8x(t) + v(t) (5.25) 

with A, B and C the corresponding Jacobian matrices and w, v zero-mean, Gaussian. 
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We used linearization techniques to approximate the statistics of Eqs. 5.23 and 5.24 
and summarized these results in an “approximate” Gauss-Markov model of Table 4.3 
of the previous chapter. Using this perturbation model, we will now incorporate it to 
construct a Bayesian processor the embeds the (A[],B[], C[-]) Jacobians linearized 
about the reference trajectory [. x*,u *]. Each of the Jacobians are deterministic and 
time-varying, since they are updated at each time-step. Replacing the (A, B) matrices 
and x(t\t — 1) in Table 5.1, respectively, by the Jacobians and 8x(t\t — 1), we obtain 
the state perturbation predicted estimate 

8x(t\t - 1) = A[x*(t - 1 )]8x(t -l\t-l) + B[u*(t - 1 )]<5w(t - 1) (5.26) 

For the Bayesian estimation problem, we are interested in the state estimate x(t\t — 1) 
not its deviation 8x(t\t — 1). From the definition of the perturbation in Sec. 4.8, 
we have 


x(t\t - 1) = Sx(f\t - 1) + x*(t) (5.27) 

where the reference trajectory x*(t) was defined previously as 

x*«) = a[x*(t - 1)] + b[u*{t - 1)] (5.28) 

Substituting this relation along with Eq. 5.26 into Eq. 5.27 gives 

XW ~ 1) = a[x*(t- l)]+A[x*(t- l)][x(t- l|f- l)-JC*(f- 1)] 

+ b[u*(t - 1)] + B[u*(t - 1 )][«(/■ - 1) - u*(t - 1)] (5.29) 


The corresponding perturbed innovation can also be found directly 

Se(t) = Sy(t) - 8y(t\t - 1) = (y(f) - /(#)) - (y(t\t - 1) - y*(t)) 

= y(0 - y(t\t - 1) = e(t) (5.30) 


Using the linear BP with deterministic Jacobian matrices results in 

8y(t\t - 1) = C[x*{t)\Sx{t\t - 1) (5.31) 

and therefore using this relation and Eq. 4.142 for the reference measurement, we 
have 


y(t\t - 1) = y*(t) + C[x*(t)]8x(t\t - 1) = c[x*(f)] + C[x*(mm - 1) (5.32) 
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Therefore it follows that the innovation is 

e(t) = y(t ) - c[x*0)] - C[x*(t)][x(t\t - 1) - x*(t)] (5.33) 

The updated estimate is easily found by substituting Eq. 5.27 to obtain 
m\t) = m\t~\) + K(t)e(t) 

MOO - x*un = [x(t\t - 1) - x*(t)] + K(t)e(t) (5.34) 

which yields the identical update equation of Table 5.1. Since the state perturbation 
estimation error is identical to the state estimation error, the corresponding error 
covariance is given by 8P(t | •) = P(t\ •) and therefore, 

Sx(t\ •) = 8x(t ) - 8x(t\ 0 = MO - x*(0] -MOO- x*(t )] = x(t) - x(t\ 0 (5.35) 

The gain is just a function of the measurement linearization, C[x*(t)] completing the 
algorithm. We summarize the discrete linearized Bayesian processor (Kalman filter) 
in Table 5.2. 


TABLE 5.2 Linearized BP (Kalman Filter) Algorithm 

Prediction 

xW - 1) = a[x*(t - 1)] + A[x*{t - 1 )][x(t - 1 |f - 1) - x\t - 1)] 

+ b[u*(f — 1)] + B[u*(t — I j\\u(t — 1) — u*{t — 1)] (state prediction) 

P(t\t — 1) = A[x*{t — l)]P(f — 1 \t — \)A'[x*(l — 1)] + R wm (t — 1) (covariance prediction) 
Innovation 

e(t) = y(t) - c[x*(t )] - C[x*(fj\[x(t\t - 1) - x*(f)l (innovation) 

R ee {t) = C[x*{t)]P{t\t — l)C'[x*(f)] + Rms(t) (innovation covariance) 


m=P(t\t - \)C'[x*(t)]R; e '(t) 

Update 

x(t\t) = Wit - 1) + K(t)e(t) 

Pm = u- K{t)c[ X *mP{t\t - 1 ) 


(gain or weight) 


(state update) 
(covariance update) 


Initial Conditions 

x(0|0) P(0|0) 


Jacobians 

da[x(t - 1)] nr db[u{t- 

dx(t - 1, . ,,» [U {t - l)] =^bdT^ 


_Oil 

du(t - 1) |„ 
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In a more formal framework, the LZ-BP can be developed under (approximate) 
Gaussian assumptions using the Bayesian approach as before in the linear case. We 
briefly outline the derivation by carefully following the steps in Sec. 5.3. 

The a posteriori probability is given by 


Pr(>’(f)|x(t» x PrfrWir,-!) 
C Ul ‘ Pr(y(0|Tr-i) 


(5.36) 


Under the Gauss-Markov model assumptions, we know that each of the conditional 
expectations can be expressed in terms of the conditional Gaussian distributions as: 

1. Pr(y(0W0) : M(c[x(t)],R vv (t)) 

2. Pr(x(t)|F,_i) : M(x(t\t - \),P(t\t - 1)) 

3. Pr(y(t)|F,-i) : N(y{t\t - \),R ee (t)) 

Using the nonlinear models developed earlier, substituting the Gaussian probabil¬ 
ities and taking logarithms, we obtain the logarithmic a posteriori probability as 


In Pr(x(t)\Y t ) - In* - ] -v'(t)R-j (t)v(t) - l -x\t\t - l)P~\t\t - \fx(t\t - 1) 

+ l -e'{t)R- e \t)e(t) (5.37) 


The MAP estimate is then obtained by differentiating this equation, setting the 
result to zero and solving for x(t), that is, 


V x \nPv{x(t)\Y t )\ x=kma =0 (5.38) 

Before we attempt to derive the MAP estimate, we first linearize about a reference 
trajectory, x* —> x with the nonlinear measurement model approximated by a first 
order Taylor series 

c[x(t)] « c[x*(t)} + Sx(t) = c[x*(t)] + C[x*(t)](x(t) - x*(t)) 

dx*(t) 

Substituting this result into Eq. 5.37, we obtain 


In Pr(x(t)\Y t ) = In k - ~(y(t ) - c[x*(t)] - C[x*(t)](x(t) - x*(t)))'R^(t) 

x (y(t) - c[x*(t)} - C[x*mx(t) - x*(t))) - l -x\t\t - 1) P~\t\t - \m\t ~ 1) 
+ - m - 1 ))'R-\t)(y(t) - y(t\t - 1)) (5.39) 
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Applying the chain rule and the gradient operator (see Chapter 3), we obtain the 
following expression. Note that the last term of Eq. 5.39 is not a function of x(f), but 
just the data, so it is null. 

V* In Pv(x(t)\Y t ) = -C’[x*(t)]R-!(t)(y(t) - c[x*(t )] - C[x*(t)](x(t) - x*(t))) 

- P~\t\t - l)[x(t) - x(t\t ~ 1)] = 0 (5.40) 

Multiplying through and grouping like terms in x(t) gives 
C'[x*(t)]R~ v \t)(y(t) - c[x*(t )] + C[x*(t)]x*(t)) 

- [p-\t\t - 1 ) + c'[x*(t)]R- v \t)c[x*mx(t)+p-\t\t - im\t - 1 ) = o 

(5.41) 


Solving for x(t) = X map (t) gives 

Xmap (0 = [P-\t\t-l) + C'[x*(t)]R- v \t)C[x*m~ l 
x [P~\t\t - \)x(t\t - 1) + C'[x*(t)]R- v '(t) 
x (y(t) - c[x*(t)] + C[x*(t)]x*m (5.42) 

U sing the matrix inversion manipulations of Eqs. 5.12-5.17 with C(t ) —> C[x*(t)], 
the first term in Eq. 5.42 becomes 

[P~\t\t - 1) + C'[x*(t)]R-Ut)C[x*(t)]r' 

= (I ~ P(t\t - l)C'[x*(t)]R- l (t)C[x*mP~ l (t\t - 1 ) 

= {i- K{t)c[x*mht\t - 1 ) = Pm (5.43) 

where K is the Kalman gain, R ee is the innovations covariance of the LZ-BP with this 
expression precisely the updated error covariance, P{t\t), as in Table 5.2. 

Solving this equation for the inverse of the predicted error covariance gives 

P~\t\t - 1) = P~\t\t) - C'[x*(t)]R~ v \t)C[x*(t)\ (5.44) 

and substituting into Eq. 5.42 using the results of Eq. 5.43 yields 

*ma P (o = Pm [(.p-'mm - 1) - c'[x*(t)]R- v \t)c[x*(tmt\t - 1» 

+ C'[x*(t)]R~ l (t)(y(t) - C[x*(t)] + C[x*(t)]x*m (5.45) 

Multiplying through by the updated error covariance and recognizing the expression 
for the Kalman gain gives 


i-mapCO = m - 1) - K(t)C[x*(t)]x(t\t - 1) + K(t)C[x*(t)]x*(t) 

+ K(t)(y(t)-c[x*m (5.46) 
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which leads to the final expression for the linearized MAP estimate as 

XmapCO = x{t\t) = x(f\t - 1) + K(t)(y(t) - c[x*(t)] - C[x*(t)](x(t\t - 1) - x*(t)) 

(5.47) 

Compare this result to Table 5.1. The error covariances and predicted estimates follow 
as in the linear case. This completes the derivation. Next let us consider a discrete 
version of the nonlinear system example given in Jazwinski [1], 


Example 5.2 

Consider the discrete nonlinear process given by 

x(t) = (1 - 0.05A7>(t - 1) + 0.04A7x 2 (f - 1) + w(f - 1) 
with corresponding measurement model 

y{t) = x\t) + x\t)+v{t) 

where v(t) ~7V r (0,0.09), x(0) = 2.3, P( 0) = 0.01, AT = 0.01 sec and R ww = 0. The 
simulated measurement using SSPACK_PC [8] is shown in Fig. 5.5c. The LZ-BP is 
designed from the following Jacobians: 

A[x{t - 1)] = 1 - 0.05A7 7 + 0.08A7x(t - 1) and C[v(f)] = 2 x(t) + 3 x 2 (t) 

Observing the mean state, we develop a reference trajectory by fitting a line to the 
simulated state which is given by 

x*(t ) = 0.067 t + 2.0 0 < t < 1.5 and u*(t) = u(t) = 0.0 Vf 

The LZ-BP algorithm is then given by 
= (1 - 0.05AT)x*a - 1) 

+ (1 - 0.05 AT + 0.08 ATx*(t - l))[x(t - 1 \t - 1) - x*{t - 1)] 

= [1 - 0.05A7 7 + 0.08A7x*(f - 1 )] 2 P{t - \\t - 1) 

= y (t) - ix* 2 (t) - X *\t)) - ( 2 x*(t) + 3x* 2 mm - 1 ) - x*m 
= [2x(t\t - 1) + 3x 2 (t\t - l)] 2 P(t\t - 1) + 0.09 
= P(t\t — l)[2x*(t) + 3x* 2 (t)] 

Reeit) 

= x(t\t - 1) + K(t)eU) 


m-i) 

e(t) 

Reeit) 

m 

X(t\t) 
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Time (s) 



FIGURE 5.5 Linearized BP simulation, (a) Estimated state (0% out) and error (0% out), 
(b) Filtered measurement (1% out) and error (innovation) (2.6% out), (c) Simulated 
measurement and zero-mean/whiteness test (3.9 x 10 -2 < 10.1 x 10 -2 and 0% out). 


Pm = a - m[2x*{t) + 3 x* 2 mP(t\t -1> 

x(0|0) = 2.3 and P( 0|0) = 0.01 

A LZ-BP run is depicted in Fig. 5.5. Here we see that the state estimate begins 
tracking the true state after the initial transient. The estimation error is good (0% 
he outside confidence limits) indicating the filter is performing properly for this 
realization. The filtered measurement and innovations are shown in Fig. 5.5b with 
their corresponding predicted confidence limits. Both estimates lie well within these 
bounds. The innovations are zero mean (3.9 x 10 -2 < 10.1 x 10 -2 ) and white (0% 
lie outside limits) as shown in Fig. 5.5c indicating proper tuning. This completes the 
nonlinear filtering example. AAA 
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5.5 EXTENDED BAYESIAN PROCESSOR (EXTENDED KALMAN FILTER) 

In this section we develop the extended Bayesian processor ( XBP ) or equivalently the 
extended Kalman filter ( EKF ). The XBP is ad hoc in nature, but has become one of 
the workhorses of (approximate) nonlinear filtering [1-11]. It has found applicability 
in a wide variety of applications such as tracking [15], navigation [1, 5], chemical 
processing [17], ocean acoustics [18], seismology [19] (see [20] for a detailed list). 
The XBP evolves directly from the linearized processor of the previous section in 
which the reference state, x*(t), used in the linearization process is replaced with 
the most recently available state estimate, x(t\t) —this is the step that makes the 
processor ad hoc. We must realize that the Jacobians used in the linearization process 
are deterministic (but time-varying), when a reference or perturbation trajectory is 
used. However, using the current state estimate is an approximation to the conditional 
mean, which is random, making these associated Jacobians and subsequent relations 
random. Therefore, although popularly ignored, most XBP designs should be based 
on ensemble operations to obtain reasonable estimates of the underlying statistics. 
With this in mind, we develop the processor directly from the LZ-BP. Thus, if instead 
of using the reference trajectory, we choose to linearize about each new state estimate 
as soon as it becomes available, then the XBP algorithm results. The reason for 
choosing to linearize about this estimate is that it represents the best information we 
have about the state and therefore most likely results in a better reference trajectory 
(state estimate). As a consequence, large initial estimation errors do not propagate; 
therefore, linearity assumptions are less likely to be violated. Thus, if we choose to 
use the current estimate x(t\a), where a is t — 1 or t, to linearize about instead of the 
reference trajectory x*(t), then the XBP evolves. That is, let 

x*(t) = x(t\a) for t - 1 < a < t (5.48) 

Then, for instance, when a = t — 1, the predicted perturbation is 

mu - 1 )=m -1) - = o (5.49) 

Thus, it follows immediately that when x*(t) =x(t\t), then Sx(t\t) = 0 as well. 

Substituting the current estimate, either prediction or update into the LZ-BP algo¬ 
rithm, it is easy to see that each of the difference terms [x—x*] are null resulting in the 
XBP algorithm. That is, examining the prediction phase of the linearized algorithm, 
substituting the current available updated estimate, x(t — 11 1— 1), for the reference 
and using the fact that (u*(t) = u{t ) V t), we have 

m -1) = cm - ii t - 1 )]+A[x(t - ii t - 1 )]m -\\t~\)-x(t-\\t- 1)] 
+ b[u(t - 1)] + B[u(t - 1 )][u(t - 1) - u(t - 1)] 

giving the prediction of the XBP 


x(t\t - 1) = am - l|t- l)] + b[u(t - 1)] 


( 5 . 50 ) 
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Now with the predicted estimate available, substituting it for the reference in Eq. 5.33, 
gives the innovation sequence as 

e(t) = m - emu - 1)] - cmv - 1)] 

x [x(t\t - 1) - x(t\t - 1)] = y(t) - c[x(t\t - 1)] (5.51) 

where we have the new predicted or filtered measurement expression 

y(t\t - 1) = c[x(t\t - 1)] (5.52) 

The updated state estimate is easily obtained by substituting the predicted estimate 
for the reference (x(t\t — 1) -»• x*(t)) in Eq. 5.34 

Sx(t\t) = 8x(t\t - 1) + K(t)e(t) 

mm - m - i)i = im - o - m - i)i +mem 

x(t\t) = m-l) + K(t)e(t) (5.53) 

The covariance and gain equations are identical to those in Table 5.2, but with the 
Jacobian matrices A, B, and C linearized about the predicted state estimate, x(t\t — 1). 
Thus, we obtain the discrete XBP or equivalently the EKF algorithm summarized in 
Table 5.3. Note that the covariance matrices, P, and the gain, K, are now functions of 


TABLE 5.3 Extended BP (Kalman Filter) Algorithm 

Prediction 

x(t\t — 1) = a\x(t — 11 f — l)| + b\u(t — 1)] (state prediction) 

P(t\t — 1) = A[x(f |f — \)]P(t — 11 1 — I )A'[x(t\t — 1)] + R mw (t — 1) (covariance prediction) 
Innovation 

e{t) = y(t) — c[x(t\t — 1)] (innovation) 

Ree(t) = C\x(t\t — \)]P(t\t — l)C'[x(t\t — 1)] + R vv (t) (innovation covariance) 

m=P{t\t\ - l)C'[x(t\t - (gain or weight) 

Update 

x(t\t) = x(t\t — 1) + K(t)e(t) (state update) 

P(t\t) = [I- K(t)C\x(t\t - \)\\P(t\t - 1) (covariance update) 

Initial Conditions 

ic(0|0) P(0|0) 


A[x(t\t - 1)] = 


da[x(t - 1)] 
dx(t - 1) 


\x=x(t\t-\i 


Jacobians 


C[x(t\t ~ 1)1 = 


dc[x{t)\ I 
dx{t) U ((|l 
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the current state estimate, which is the approximate conditional mean estimate and 
therefore a single realization of a stochastic process. Thus, ensemble (Monte Carlo) 
techniques should be used to evaluate estimator performance, that is, for new initial 
conditions selected by a Gaussian random number generator (either x(0|0) or P(0|0)) 
the algorithm is executed generating a set of estimates which should be averaged over 
the entire ensemble using this approach to get an “expected” state, etc. Note also in 
practice that this algorithm is usually implemented using sequential processing and 
UD (upper diagonal/square root) factorization techniques (see [21]). 

The XBP can also be developed under (approximate) Gaussian assumptions using 
the Bayesian approach as before in the linear case. We briefly outline the derivation. 

The a posteriori probability is given by 


PrOW,-!) 


(5.54) 


Under the Gauss-Markov model assumptions, we know that each of the conditional 
expectations can be expressed in terms of the conditional Gaussian distributions as: 

1. Vr(y(t)\x(t)) : Af(c[x(t)l R vv (t)) 

2. Pr(x(t)|F,_i) : - V),P(t\t - 1» 

3. Pr(y(f)|F,-i) : Af(y(t\t - l),R ee (.t)) 


Using the nonlinear models developed earlier, substituting the Gaussian probabil¬ 
ities and taking logarithms, we obtain the logarithmic a posteriori probability as 


In Pr(x(f)|F,) = In *■ - -v'(t)R^ (t)v(t) - -x\t\t - \)P~\t\t - l)x(t\t ~ 1) 

+ l -e'm~ e \t)e(t) (5.55) 


The MAP estimate is then obtained by differentiating Eq. 5.55, setting it to zero 
and solving; that is, 


VJn Pr(x(t)|F ( )|^ ma =0 (5.56) 

Applying the chain rule and the gradient operation (see Chapter 3), we obtain the 
expression. Note that the last term of Eq. 5.55 is not a function of x(f), but just the 
data. 


V, In Pr(x(f) Y,) = X x c\x(t)]R^ (t)(y(t) - c[x(t)]) 

- P~\t\t - l)(x(t) - x{t\t ~ 1)) (5.57) 

Using a first order Taylor series approximation for c[x(t)\ linearized about the pre¬ 
dicted estimate x(t \ t — 1) and the usual definition for the J acobian matrix C[x(t\t — 1)] 
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V.v In Pr(x(f)| Y t ) = Cm\t ~ I)]/?J(0Lv(0 - c[x(t\t - 1)] 

- C[x(t\t ~ TOO - m - 1))] - P \t\t - 1)140 - m - 1)] (5.58) 

Identifying the innovation vector and grouping like-terms gives the expression 

Ye In Pr(40|0) = C'[x(t\t- \)]R-J(t)e(t) 

-[c'm\t-i)]Rv V \t)c[m-i)]+mt-i)]x(t) 

+ [C[x{t\t - ljlJ^CO c\x{t\t - 1)] + P(t\t - \)]x{t\t - 1) 

Setting this equation to zero and solving for 40 = Ynap(0 gives 

Ynap(o = iem\t - vw-jwmu - 1)]+ ht\t - 1)] _1 

x [c'im - i)]R~ v \t)cm\t - 1 )]+mt - Dmv - o 
+ [C'[x(t\t - l)]R- v \t)C[x(.t\t - 1)] 

+ P(t\t - - l)]R- v \t)e(t) (5.59) 


or 


Y»ap(0 = m - l) + [C’[x(t\t - l)|/? n l (0C|4/|/ - l)] 

+ P(t\t - 1)] 'c'|4/|t - I)|/?,,‘ (040 (5.60) 

Recognizing the similarity of this expression to the linear case with 
C[x(t\t — 1)] —> C(0 and using the matrix inversion lemma, it is easy to show (see 
Eqns. 5.13-5.17) that our approximate updated error covariance satisfies 

P(t\t) = [Cm\t - l)]R- v \t)C[x(t\t ~ 1)] + P~\t\t - l)]- 1 (5.61) 

and therefore Eq. 5.60 can be simplified to give 

Yna P (o = m -1)+ Pmc'im - toJomo (5.62) 

Now recognizing the alternate form for the gain as in Eq. 5.21 gives the desired 
updated estimate as 


x(t\t) = Yna P (0 = x(t\t - 1) + K{t)e{t) (5.63) 

This completes the derivation. Consider the following discrete nonlinear example 
of the previous section. 
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Example 5.3 

Consider the discrete nonlinear process and measurement system described in the 
previous example. The simulated measurement using SSPACK_PC [8] is shown in 
Fig. 5.6c. The XBP is designed from the following Jacobian: 

A[x(t - 1)] = 1 - 0.05A7 7 + 0.08A7x(f - 1) and C[x(t)] = 2x(t) + 3 x 2 {t) 

The XBP algorithm is then given by 

x{t\t- 1) = (1 - 0.05AT)v(f - 1| t- 1) + 0.04A Tx 2 (t - 1| t- 1) 

P{t\t - 1) = [1 - 0.05AT + 0.08A7x(f - 1 )] 2 P(t - l\t - 1) 

e(t) = y(t)-x 2 (t\t-l)-x 3 (t\t-l) 

R ee {t) = [2Sc(t\t - 1) + 3x 2 (t\t - l)] 2 P(t\t - 1) + 0.09 

K(t ) = iP(t\t - mm - 1 ) + 3x\t\t - 1 )])R~\t) 
xm = m - 1 )+ me(t ) 

Pm = (i - Kimmt -1)+3x 2 (t|t - mPw - 1) 
x(0|0) = 2.3 and F(0|0) = 0.01 

A XBP run is depicted in Fig. 5.6. Here we see that the state estimate begins 
tracking the true state after the initial transient. The estimation error is reasonable 
(~1% lie outside limits) indicating the filter is performing properly for this realiza¬ 
tion. The filtered measurement and innovations are shown in Fig. 5.6b and lie within 
the predicted limits. The innovations are zero mean (6.3 x 10 -2 < 11.8 x 10 -2 ) 
and white (0% lie outside limits) as shown in Fig. 5.6c indicating proper tuning. 
Comparing the XBP to the LZ-BP of the previous section shows that it performs 
slightly worse in terms of predicted covariance limits for the estimated measure¬ 
ment and innovations. Most of this error is caused by the initial conditions of the 
processor. 

Running an ensemble of 101 realizations of this processor yields similar results 
for the ensemble estimates: state estimation error increased to (~2% outside limits), 
innovations zero mean test increased slightly (6.7 x 10 -2 < 12 x 10 -2 ) and whiteness 
was identical. This completes the example. AAA 

Next we consider one of the most popular applications of XBP approach—the 
tracking problem [1, 5, 15, 20], The choice of the coordinate system for the tracker 
determines whether the nonlinearities occur either in the state equations (polar coor¬ 
dinates [22]) or in the measurement equations (Cartesian coordinates [23]). The 
following application depicts typical XBP performance in Cartesian coordinates. 
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FIGURE 5.6 XBP (EKF) simulation, (a) Estimated state (1% out) and error (1% out), 
(b) Filtered measurement (1.3% out) and error (innovation) (3.3% out), (c) Simulated 
measurement and zero-mean/whiteness test (6.3 x 10 -2 < 11.8 x 10 -2 and 0% out). 


Example 5.4 

Consider the following passive localization and tracking problem that frequently 
arises in sonar and navigation applications [15]. A maneuvering observer O monitors 
noisy “bearings-only” measurements from a target t assumed to be traveling at a 
constant velocity. These measurements are to be used to estimate target position r 
and velocity v. The problem is geometrically depicted in Fig. 5.7. The velocity and 
position of the target relative to the observer are defined by 


vjt) := v tx (t) - v ox (t) 

Vy(t) := Vty(t) - Voy(t) 


c(0 := r tx (t) - r ox (t) 
At) ■■= rtyit) - r oy (f) 
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FIGURE 5.7 Observer/target ground track geometry for the XBP tracking application. 

The velocity is related to position by 

v(t) = —r(t) ~ ———— for AT the sampling interval 


or 


r(t) = r(t - 1) + ATv(t - 1) 


and 


v(t) = v(t - 1) + [v(f) - v(t - 1)] 

V(t) = [v t (t - 1) - v 0 (t - 1)] - [n o (0 - v a (t - 1)] = v(t - 1) - A v a (t - 1) 


for a constant velocity target iy(f) = v t (t — 1) = • • • = v t and An is the incremental 
change in observer velocity. Using these relations, we can easily develop a Gauss- 
Markov model of the equations of motion in two dimensions by defining the state 
vector as x' := [ r x r y v x v y ] and input t/ := [—Av ox — Av oy ], but first let us consider 
the measurement model. 

For this problem we have the bearing relation given by 


The entire system can be represented as a Gauss-Markov model with the noise 
sources representing uncertainties in the states and measurements. Thus, we have the 
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equations of motion given by 


'1 0 AT 0 ' 


'0 0" 


0 1 0 AT 

x(t - 1) + 

0 0 

~-A v ox (t - 1)" 

0 0 1 0 

1 0 

-Av oy (t- 1)_ 

0 0 0 1 


0 1 


with the nonlinear sensor model given by 


for w ~ TVfO, R ww ) and v ~ Af(0,R VV ). The SSPACK_PC software [8] was used to 
simulate this system which is depicted in Fig. 5.8 for two legs of a tracking scenario. 
An impulse-incremental step change ( Av ox = —24 knots and Av oy = +10 knots) was 
initiated at 0.5 h, resulting in a change of observer position and velocity depicted in 
the figure. The simulated bearing measurements are shown in Fig. 5.6d, The initial 
conditions for the run were x'(0) := [0 15nm20k — 10k] and R ww = diag 10 -6 with 
the measurement noise covariance given by R vv = 3.05 x 10 _4 rad 2 for AT = 0.33 h. 

The XBP algorithm of Table 5.3 is implemented using this model and the following 
Jacobian matrices derived from the Gauss-Markov model above: 


A[x] = A and C[x] = 


* 2(0 —*i (0 
R 2 R 2 


0 0 


where 

r= y* 2 (o + 4(t) 

The results of this run are shown in Fig. 5.8. In a and b we see the respective x and 
y position estimates (velocity estimates are not shown) and corresponding tracking 
errors. Here we see that it takes approximately 1.8 h for the estimator to converge to the 
true target position (within ±1 nm). The innovations sequence appears statistically 
zero-mean and white in Fig. 5.8c and d, indicating satisfactory performance. The 
filtered measurement in c is improved considerably over the unprocessed data in d. 
This completes the example. AAA 

This completes the section on the extension of the BP to nonlinear problems. Next 
we investigate variants of the XBP for improved performance. 


5.6 ITERATED-EXTENDED BAYESIAN PROCESSOR 
(ITERATED-EXTENDED KALMAN FILTER) 

In this section we discuss an extension of the XBP of the previous section to the 
iterated-extended ( IX-BP ). We heuristically motivate the design and then discuss a 
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FIGURE 5.8 Extended BP (EKF) simulation for bearings-only tracking problem, (a) X- 
position estimate and error, (b) Y-position estimate and error, (c) Filtered measurement 
and error (innovation), (d) Simulated measurement and zero-mean/whiteness test 
(3.2 x 10 -3 <6.1 x 10 -3 and 0% out). 
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more detailed approach using Bayesian MAP estimation coupled with numerical opti¬ 
mization techniques to develop the processor. This algorithm is based on performing 
“local” iterations (not global) at a point in time, t to improve the reference trajec¬ 
tory and therefore the underlying estimate in the presence of significant measurement 
nonlinearities [1], A local iteration implies that the inherent recursive structure of the 
processor is retained providing new estimates as the new measurements are made 
available. 

To develop the iterated-extended processor, we start with the linearized processor 
update relation substituting the “linearized” innovation of Eq. 5.33 of the LZ-BP, 
that is, 

xm = m- 1 )+K(t-x*(t)) \yit)-c[x*{t)\ - C[x*(t )] (x(t\t- l)-**(f))] (5.64) 

where we have explicitly shown the dependence of the gain (through the measure¬ 
ment Jacobian) on the reference trajectory, x*{t). The XBP algorithm linearizes about 
the most currently available estimate, x*(t) = x(t\t — 1) in this case. Theoretically, 
the updated estimate, x(t\t) is a better estimate and closer to the true trajectory. Sup¬ 
pose we continue and re-linearize about x(t\t) when it becomes available and then 
recompute the corrected estimate and so on. That is, define the (i + l) th -iterated 
estimate as x; + i(f|t), then the updated iterator equation becomes 

x m (,t\t) = x(t\t — 1) + K(t',xiit\t)) 

x [y(f) - c[xi(t\t)] - C[xi(tmm - 1) - Ht\m (5.65) 

Now if we start with the 0 th iterate as the predicted estimate, that is, x 0 = x(t\t — I), 
then the XBP results for i = 0. Clearly, the updated estimate in this iteration is 
given by 

x l m=m-i)+K 0 (t)\y(t) - cm\tt - emu - mm - 1 ) - m - 1 »] 

(5.66) 

where the last term in this expression is null leaving the usual innovation. Note also 
that the gain is reevaluated on each iteration as are the measurement function and 
Jacobian. The iterations continue until there is little difference in consecutive iterates. 
The last iterate is taken as the updated estimate. The complete (updated) iterative loop 
is given by: 


eM = y(t ) - c[xj(f|0] 

R eiei (t) = cm\tw(t\t-\)c'm\t)}+Rm) 

Ki(t ) = P(t\t - l)C\Ut\t)]R-' ei (t) 

mm = m -1)+ Km.em - emmmu - o- mm 

put io = (/ - Ki(t)cmmp{t\t - 1 ) ( 5 . 67 ) 
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TABLE 5.4 Iterated Extended BP (Kalman Filter) Algorithm 


Prediction 

x(t\t — 1) = a[x(t — l|r — 1)] — b[u{t — 1)] (state prediction) 

P(t\t — 1) = A[x(t\t — l)]P(f — l|f — \)A'\x(t\t — 1)] — R ww (t — 1) (covariance prediction) 
LOOP: i=h...,Niterations 
Innovation 

e;{t) = y(t) - c[x,(t\t)] 

R em (t) = C[Xi(t\mHt\t - l)C'lJci(r|/)J - R m {t) 

K i (t)=P(t\t - \)C[Ht\m:^(t) 

Update 

Xi+\(t\t) = x(t\t - 1) + W)[e L (t) - C[x,(t\t)](x(t\t - 1) - x,(t\m (state update) 

P,(t\t) — [I — Ki(t)C[xi(t\t)]]P(t\t — 1) (covariance update) 

Initial Conditions 

x(0|0), P(0|0), x„m=x(t\t - 1) 


(innovation) 
(innovation covariance) 

(gain or weight) 


A[x{t\t ~ 1)] = 


da[x(t - 1)] I 
dx(t - 1) L =i(f|f _! 


A typical stopping rule is: 


II x i+ i(t\t) - xi(t\t) || < e and Xi{t\t) ^ x(t\t) (5.68) 


The IX-BP algorithm is summarized in Table 5.4. 

The IX-BP can be useful in reducing the measurement function nonlinearity 
approximation errors improving processor performance. It is designed for measure¬ 
ment nonlinearities and does not improve the previous reference trajectory, but it will 
improve on the subsequent one. Next we take a slightly more formal approach to 
developing the IX-BP from the Bayesian perspective, that is, we first formulate a 
parametric optimization problem to develop the generic structure of the iterator, then 
apply it to the underlying state estimation problem. Let us assume that we have a non¬ 
linear cost function, /(©), that we would like to maximize relative to the parameter 
vector, © e R n& x 1 . We begin by expanding the cost in terms of a Taylor series about 
the ©„ that is, 


/(©) = J(@i) + (© - ©//[VeA©,)] + -(© - © - ©,) + H.O.T. 

(5.69) 
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where V© is the N@ -gradient vector defined by 

97(0) | 

Ve/(©i):=—^ (5.70) 

9© 1©=©,. 

with the corresponding N@ x /V© Hessian matrix defined by 
d 2 J(&) | 

V 00 7(©,):=^^ (5.71) 

90 2 1©=©,. 

Now if we approximate this expression by neglecting the H.O.T. and assume that 
©,■ is close to the true parameter vector (©,- ~ ©tme)> then differentiating Eq. 5.69 
using the chain rule, we obtain 


V 0 7(©i) = 0 + V 0 ©'[V 0 7(©,')] + ^(2[V 00 7(©,)](© - ©,)) = 0 


or 


[V© e /(0i)](0 - ©0 = — [V 0 7(©,)] 

Solving for © and letting © —> © i+ i we obtain the well-known Newton-Rhapson 
iterator (NRI) as [1, 15, 16] 

©i+i = ©/ - [V©©7(© / )]- 1 [V©7(©,)] (5.72) 

This is the form that our IX-BP will assume. Now let us return to the basic problem 
of improved state estimation using the NRI. 

Under the usual Gaussian assumptions, we would like to calculate the MAP esti¬ 
mate of the state at time t based on the data up to time t, therefore, the a posteriori 
probability (see Sec. 5.4) is given by 

Pr(x(f)| Y t ) = - x Pr(y(t)W0) x Pr(x(t)|F,_i) (5.73) 

V 

where r) is a normalizing probability (not a function of the state) and can be ignored 
in this situation. As in the linear case, we have 

1. Pr(y(t)Wf)): N(c[x(t)],R vv (t)) 

2. Pr(x(t)|F f -i) : N(x<,t\t - \),P(t\t - 1)) 

Maximizing the a posteriori probability is equivalent to minimizing its logarithm, 
therefore, we have that 

J(x(t)) = lnPr(y(0|x(t)) + lnPr(x(0IU-i) 

= ~\W) ~ c[xm'R- v \t)(y(t) - c[x(t)]) 

- l -{x{t) - x(t\t - D)'P-\t\t - I )(x(t) - x(t\t - 1)) (5.74) 
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Minimizing this expression, we differentiate with respect to x(t) using the gradient 
operator as before to obtain 

S/ X j(x{t)) = | V,c' | *(011/?,,,, 1 H)iyit) -c|,v(0D-P-' Ht\l-\)(x(l)-x(t\t- I)) (5.75) 
With a slight abuse, we can simplify the notation by defining the relations 

C[x,(f)] := l.vto^U). e i(t) ■= y(t ) - c[xj(t)], and x,(t) := x t (t)-x(t\t - 1) 

oxyt) 

Therefore, letting x —> x,-, we can write Eq. 5.75 as 

VJ(xM) = C'[xi(t)]R- v \t)ei(t) - P~\t\t - l)xMt - 1) (5.76) 

which is the same form of the linear MAP estimator of Eq. 5.10 with C[x,(f)] C(t). 

Continuing with the NRI derivation, we differentiate Eq. 5.76 again to obtain the 
Hessian 

V„y(x,(0) = C'[xi(t)]R- v \t)C[ Xi (t)] + P \t\t - 1) (5.77) 

but applying the matrix inversion lemma as in Sec. 5.4 (see Eq. 5.13 for details), it is 
shown that 

[V«/(x i (0)]“ 1 = (/ - Ki(t)C[ximht\t ~ 1) = Pi(t\t) (5.78) 

for 

Kilt) := = P(t\t - \)C'[ Xl (t)]R-ljt) (5.79) 

with 

Remit) = C[x,(t)]P(t\t - 1 )C'[xm + Rwit) (5.80) 

Now equating Eqs. 5.76 and 5.78 and solving for P~ l {t\t — 1), we obtain 

P~\t\t - 1) = p-\t\t) - C'[xi(t)]R- v \t)C[xi(t)] (5.81) 

which can be substituted back into Eq. 5.76 to give 

V,J(x,(/)) = C'l Xi (t)\R n }(t)ei(t) - \P \t\t) - C'[xi(t)]R-\t)C[ximxi(t\t - 1) 

(5.82) 

The NRI can now be written in terms of the iterated state at time t as 

x i+ iit) = xtit) - I V x ,J(x,(/))]' 1 V v 7(.v,(/)) (5.83) 

Using the expressions for the Hessian and gradient, we have 
XM lit) = Xi {t) + Pi(t\t) 

x(C , [x,(0]7?“ 1 (0^(f) - [P~\t\t) - C'lXiitW^iDClxiitMMt ~ 1)) 

(5.84) 
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Multiplying through by the Hessian and recognizing the alternate expression for 
the gain (see Eq. 5.22), we obtain the expression 

x i+l (t) = xid ) + KMeM -(I- Ki(t)C[ Xi mm I* - 1) (5-85) 

Now multiplying through by x(t\t — 1), factoring out the gain and performing the 
additions, we have 

x i+ i(t) = m -1)+ w)[e.i{t)-c[ Xi mm - n - Xi m (5.86) 

Defining the iterate in terms of the corrected state estimate, that is, x,(f)—x,(f|t) 
gives the NR I iterator of Eq. 5.67 and Table 5.4. 

So we see that for strong measurement nonlinearities the IX-BP can be used for lit¬ 
tle cost to the XBP. A further extension of these results is called the iterator-smoother 
XBP in which the entire processor is iterated to mitigate strong nonlinearities in the 
predicted estimates [1], Here the measurement is relinearized and then a “smoothed” 
state estimate is calculated and used in the prediction loop. This completes the 
discussion of the processor, next we demonstrate its performance. 

Example 5.5 

Consider the discrete nonlinear process and measurement system described in the 
previous example. The simulated measurement using SSPACK_PC [8] is shown in 
Fig. 5.9c. The IX-BP is designed from the following Jacobian: 

A[x(t - 1)] = 1 - 0.05AT + 0.08ATx(t - 1) and C[ X (t)] = 2x(t) + 3 x 2 (t) 

The IX-BP algorithm is then given by 

x{t\t - 1) = (1 - 0.05AT)x(t - l|f- 1) + 0.04ATx 2 (f - l|t- 1) 

P(t\t - 1) = [1 - 0.05AT + 0.08A7x(t - 1 )] 2 P{t — l|f — 1) 

e,(t) = y (t) - x fm - 

R eiei (t) = \2 Xi (t\t) + 3xf(t\t)] 2 P(t\t - 1) + 0.09 
Kiit) = P(t\t-l)[2x i m + 3xfm]/Re l e i (.t ) 

X i + m) = m - 1 ) + m)[et(t) -12urn +- 1 ) - urn 

hm = (i - KiUm.m+ixfmmtit - 1) 

x(0|0) = 2.3 and E(0|0) = 0.01 

A IX-BP run is depicted in Fig. 5.9. Here we see that the state estimate (~0% lie 
within limits) begins tracking the true state instantaneously. The estimation error is 
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Lag time (s) 


FIGURE 5.9 Iterated-extended BP (IEKF) simulation, (a) Estimated state (~0% out) and 
error (~0% out), (b) Filtered measurement (~1% out) and error (innovation) (~2.6% 
out), (c) Simulated measurement and zero-mean/whiteness test (4x 10 -2 < 10.7 x 10 -2 
and 0% out). 


reasonable (~0% out) indicating the filter is performing properly for this realization. 
The filtered measurement (~1% out) and innovations (~2.6% out) are shown in 
Fig. 5.9b. The innovations are zero mean (4 x 10 -2 < 10.7 x 10 -2 ) and white (0% 
lie outside limits) as shown in Fig. 5.9c indicating proper tuning and matching the 
LZ-BP result almost exactly. 

Running an ensemble of 101 realizations of this processor yields similar 
results for the ensemble estimates: innovations zero mean test decreased slightly 
(3.7 x 10 -2 < 10.3 x 10 -2 ) and whiteness was identical. So the overall effect of the 
IX-BP was to decrease the measurement nonlinearity effect especially in the initial 
transient of the algorithm. These results are almost identical to those of the LZ-BP. 
This completes the nonlinear filtering example. AAA 

Next we consider some more practical approaches to designing estimators. 
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5.7 PRACTICAL ASPECTS OF CLASSICAL BAYESIAN PROCESSORS 

In this section we heuristically provide an intuitive feel for the operation of the 
Bayesian processor using the state-space model and GM assumptions. These results 
coupled with the theoretical points developed in [10] lead to the proper adjustment 
or “tuning” of the BP. Tuning the processor is considered an art, but with proper 
statistical tests, the performance can readily be evaluated and adjusted. As men¬ 
tioned previously, this approach is called the minimum (error) variance design. 
In contrast to standard filter design procedures in signal processing, the minimum 
variance design adjusts the statistical parameters (e.g., covariances) of the pro¬ 
cessor and examines the innovations sequence to determine if the BP is properly 
tuned. Once tuned, then all of the statistics (conditional means and variances) are 
valid and may be used as reasonable estimates. Here we discuss how the para¬ 
meters can be adjusted and what statistical tests must be performed to evaluate BP 
performance. 

Heuristically, the sequential BP can be viewed simply by its update equation 



State-space model Measurement 


where Z 0 id ~/(model) and E new ^/(measurement). 

Using this model of the BP, we see that we can view the old, or predicted estimate 
Z 0 id as a function of the state-space model (A, B) and the prediction error or innovation 
E as a function primarily of the new measurement, as indicated in Table 5.1. Consider 
the new estimate under the following cases: 

K —* small Anew = Add = /(model) 

K —^ large A new = KE nev/ = /(measurement) 

So we can see that the operation of the processor is pivoted about the values of the 
gain or weighting matrix K. For small K, the processor “believes” the model, and for 
large K, the processor believes the measurement (Fig. 5.10). 

Let us investigate the gain matrix and see if its variations are consistent with these 
heuristic notions. First, it was shown in Eq. 5.22 that the alternate form of the gain 
equation is given by 

K(t) = P(t\t)C r (t)R- v l (t) 

Thus, the condition where K is small can occur in two cases: (1) P is small (fixed 
R vv ) which is consistent because small P implies that the model is adequate; and 
(2) R vv is large (P fixed), which is also consistent because large R vv implies that the 
measurement is noisy, so again believe the model. 

For the condition where K is large two cases can also occur: (1) K is large when P is 
large (fixed R vv ), implying that the model is inadequate, so believe the measurement; 
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Condition 


Gain 


Parameter 


Believe model 


Small P small (model adequate) 

R m large (measurement noisy) 
Large R m small (measurement good) 

P large (model inadequate) 


Believe measurement 


FIGURE 5.10 Bayesian processor heuristic notions. 


and (2) R vv is small (P fixed), implying the measurement is good (high SNR). So we 
see that our heuristic notions are based on specific theoretical relationships between 
the parameters in the BP algorithm of Table 5.1. 

Summarizing, a BP (Kalman filter) is not functioning properly when the gain 
becomes small and the measurements still contain information necessary for the 
estimates. The filter is said to diverge under these conditions. In this case, it is 
necessary to detect how the filter is functioning and how to adjust it if necessary, but 
first we consider the tuned BP. 

When the processor is “tuned”, it provides an optimal or minimum (error) variance 
estimate of the state. The innovations sequence, which was instrumental in deriving 
the processor, also provides the starting point to check the BP operation. A necessary 
and sufficient condition for a BP to be optimal is that the innovation sequence is 
zero-mean and white (see [4] for the proof). These are the first properties that must 
be evaluated to ensure that the processor is operating properly. If we assume that the 
innovation sequence is ergodic and Gaussian, then we can use the sample mean as 
the test statistic to estimate, m e , the population mean. The sample mean for the i th 
component of e, is given by 



(5.87) 


where m e (i) ~ J\f(m e ,R ee (i)/N) and N is the number of data samples. We perform a 
statistical hypothesis test to “decide” if the innovation mean is zero [10]. We test that 
the mean of the i th component of the innovation vector e,(f) is 


H 0 : m e {i) = 0 
Hi : m e (i) £ 0 


As our test statistic we use the sample mean. At the a-significance level, the probability 
of rejecting the null hypothesis H a is given by 



(5.88) 
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Therefore, the zero-mean test [10] on each component innovation e, is given by 


m e (i) 


(5.89) 


Under the null hypothesis H a , each m e (i ) is zero. Therefore, at the 5% significance 
level (a = 0.05), we have that the threshold is 



(5.90) 


where R ee {i) is the sample variance (assuming ergodicity) estimated by 

««(■> = vEAo 


Under the same assumptions, we can perform a whiteness test [10], that is, check 
statistically that the innovations covariance corresponds to that of an uncorrelated 
(white) sequence. Again assuming ergodicity of the innovations sequence, we use 
the sample covariance function as our test statistic with the i th component covariance 
given by 

1 N 

R ee (i, k) = - J2 (*i(0 - m e mei{t + k)~ m e (i)) (5.92) 

t=k +1 

We actually use the normalized covariance test statistic 


Peed, k) (5.93) 

Ree(i) 

Asymptotically for large N, it can be shown that (see [10-14]) that 
p ee (i,k)~Af(0,l/N) 


Therefore, the 95% confidence interval estimate is 

U = Peed, k)± l -4^ for N > 30 (5.94) 

ViV 

Hence, under the null hypothesis, 95% of the p ee {i,k) values must lie within this 
confidence interval, that is, for each component innovation sequence to be considered 
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statistically white. Similar tests can be constructed for the cross-covariance properties 
of the innovations [13] as well, that is, 

Cov(e(t), e(k)) = 0 and Cov(e(f), u(t — 1)) = 0 

The whiteness test of Eq. 5.94 is very useful for detecting model inaccuracies from 
individual component innovations. However, for complex systems with a large num¬ 
ber of measurement channels, it becomes computationally burdensome to investigate 
each innovation component-wise. A statistic capturing all of the innovation infor¬ 
mation is the weighted sum-squared residual (WSSR) [14], It aggregates all of 
the innovation vector information over some finite window of length N. It can be 
shown that the WSSR is related to a maximum-likelihood estimate of the normalized 
innovations variance [10, 14]. The WSSR test statistic is given by 


p{t) : = J2 e\k)R~\k)e(k) for 1>N (5.95) 

k=t-N +1 


and is based on the hypothesis test 

H a : pit) is white 
Hi : pit) is not white 


given by 



(5.96) 


Under the null hypothesis, the WSSR is chi-squared distributed, pit) ~ x 2 (N y N). 
However, for N y N > 30, pit) is approximately Gaussian Af(N y N, 2N y N) (see [4] 
for more details). At the a-significance level, the probability of rejecting the null 
hypothesis is given by 


Pr 


/ Pit) - NyN 
\ y/lN^N 



For a level of significance of a — 0.05, we have 


(5.97) 


T = NyN + 1.96/2AyV (5.98) 

Thus, the WSSR can be considered a “whiteness test” of the innovations vector 
over a finite window of length N. Note that since [{e{t)}, {R ee (t)Y\ are obtained from 
the state-space BP algorithm directly, they can be used for both stationary as well 
as nonstationary processes. In fact, in practice for a large number of measurement 
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Data 

Property 

Statistic 

Test 

Assumptions 

Innovation 

m e = 0 

Sample mean 

Zero mean 

Ergodic, gaussian 


Reed) 

Sample covariance 

Whiteness 

Ergodic, gaussian 


Pd) 

WSSR 

Whiteness 

Gaussian 


R ee (t,k) 

Sample cross¬ 
covariance 

Cross-covariance 

Ergodic, gaussian 


R eu d,k) 

Sample cross¬ 
covariance 

Cross-covariance 

Ergodic, gaussian 

Covariances 

Innovation 

Innovation 

Sample variance 

R ee 

R ee = R ee 

Confidence interval 
about (e(f)} 

Ergodic 


Estimation 

error 

Sample variance 

P=P 

Ergodic, X true 
known 


Estimation 

error 

P 

Confidence interval 
about {x(t |f)> 

X true known 


FIGURE 5.11 State-space BP tuning tests. 


components, the WSSR is used to “tune” the filter and then the component innovations 
are individually analyzed to detect model mismatches. Note also that the adjustable 
parameter of the WSSR statistic is the window length N, which essentially controls 
the width of the window sliding through the innovations sequence. 

Other sets of “reasonableness” tests can be performed using the covariances esti¬ 
mated by the BP algorithm and sample variances estimated using Eq. 5.92. The BP 
provides estimates of the respective processor covariances R ee and P from the relations 
given in Table 5.1. Using sample variance estimators, when the filter reaches steady 
state (process is stationary), that is, P is constant, the estimates can be compared to 
ensure that they are reasonable. Thus we have 

Reed) » Reed) and Pd) ^ Pd) (5.99) 

Plotting the ±1.96^/R eiei d) and ±1.96 -Jpjdt\t) about the component innovations 
{e,(f)} and component state estimation errors {ic,(f|f)}, when the true state is known 
provides an accurate estimate of the BP performance especially when simulation is 
used. If the covariance estimates of the processor are reasonable, then 95% of the 
sequence samples should lie within the constructed bounds. Violation of these bounds 
clearly indicate inadequacies in modeling the processor statistics. We summarize these 
results in Fig. 5.11 and examine an 7?LC-circuit design problem in the following 
section to demonstrate the approach in more detail. 


5.8 CASE STUDY: RLC CIRCUIT PROBLEM 

Consider the design of an estimator for a series RLC circuit (second-order system) 
excited by a pulse train. The circuit diagram is shown in Fig. 5.12. Using Kirchhoff’s 
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voltage law, we can obtain the circuit equations with i = C(de/dt): 

d 2 e Rde 1 1 

~dfl + Ldt + LC e ~ LC ein 

where is a unit pulse train. This equation is that of a second-order system that 
characterizes the electrical RLC circuit, or a mechanical vibration system, or a 
hydraulic flow system, etc. The dynamic equations can be placed in state-space 
form by choosing x:=\e\ de/dt]' and u = e in : 



where w ~ N(0, R ww ) is used to model component inaccuracies. 

A high-impedance voltmeter is placed in the circuit to measure the capacitor 
voltage e. We assume that it is a digital (sampled) device contaminated with noise of 
variance R vv \ that is, 


y(t) = e(t) + v(t) 

where v ~ N(0, R vv ). For our problem we have the following parameters: 
7? = 5K£2,L = 2.5H, C = 0.1/zF, and r = 0.1 ms (the problem will be scaled in 
milliseconds). We assume that the component inaccuracies can be modeled using 
R ww = 0.01, characterizing a deviation of ±0.1V uncertainty in the circuit repre¬ 
sentation. Finally, we assume that the precision of the voltmeter measurements are 
(e ± 0.2 V), the two standard deviation value, so that R vv = 0.01 (V) 2 . Summarizing 
the circuit model, we have the continuous-time representation 

Hi i]»[i]-*[i]- 
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and the discrete-time measurements 

y{t)=[\ ()|.r(0 + v(t) 


where 

R ww = = 0.01(V) 2 and R vv = 0.01(V) 2 

Before we design the discrete Bayesian processor, we must convert the system or 
process model to a sampled-data (discrete) representation. Using SSPACK_PC [8], 
this is accomplished automatically with the Taylor series approach to approximating 
the matrix exponential. For an error tolerance of 1 x 10 -12 , a 15-term series expansion 
yields the following discrete-time Gauss-Markov model: 


x(t) = 


'0.98 

-0.36 


0.09 ' 
0.801_ 


x(t - 1) + 


'-0.019' 
- 0.36 _ 


u(t - 1) + 


'-0.019' 
- 0.36 _ 


w(t - 1) 


>(/) = [1 | 0]x(t) + v(t) 

where 

R ww = 0.01(V) 2 and R vv = 0.01(V) 2 

Using SSPACK_PC with initial conditions x(0) = 0 and P = diag(0.01,0.04), the 
simulated system is depicted in Fig. 5.13. In Fig. 5.13a-c we see the simulated 
states and measurements with corresponding confidence limits about the mean (true) 
values. In each case, the simulation satisfies the statistical properties of the GM model. 
The corresponding true (mean) trajectories are also shown along with the pulse train 
excitation. Note that the measurements are merely a noisier (process and measurement 
noise) version of the voltage xi. 

A discrete Bayesian processor was designed using SSPACK_PC to improve the 
estimated voltage x \. The results are shown in Fig. 5.14. In a through c we see the 
filtered states and measurements as well as the corresponding estimation errors. 
The true (mean) states are superimposed as well to indicate the tracking capability of 
the estimator. The estimation errors lie within the bounds (3 percent out) for the second 
state, but the error covariance is slightly underestimated for the first state (14 percent 
out). The predicted and sample variances are close (0.002 ~ 0.004 and 0.028 ~ 0.015) 
in both cases. The innovations lie within the bounds (3 percent out) with the predicted 
sample variances close (0.011» 0.013). The innovations are statistically zero-mean 
(0.0046 0.014) and white (5 percent out, WSSR below threshold), 4 indicating a 

well-tuned estimator. This completes the RLC problem. 


4 WSSR is the weighted-sum squared residual statistic which aggregates the innovation vector information 
over a window to perform a vector-type whiteness test (see [2] for details). 
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FIGURE 5.13 Gauss-Markov simulation of RLC circuit problem, (a) Simulated and true 
state (voltage), (b) Simulated true state (current), (c) Simulated and true measurement, 
(d) Pulse-train excitation. 
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FIGURE 5.14 Bayesian processor design for RLC circuit problem, (a) Estimated state (volt¬ 
age) and error, (b) Estimated state (current) and error, (c) Filtered and true measurement 
(voltage) and error (innovation), (d) WSSR and whiteness test. 
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5.9 SUMMARY 

In this chapter we have introduced the concepts of classical linear and nonlinear 
Bayesian signal processing using state-space models. After developing the idea of 
linearizing a nonlinear state-space system, we developed the linearized Bayesian pro¬ 
cessor ( LZ-BP ). It was shown that the resulting processor provides an approximate 
solution (time-varying) to the nonlinear state estimation problem. We then devel¬ 
oped the extended Bayesian processor ( XBP ) or equivalently the extended Kalman 
filter ( EKF ), as a special case of the LZ-BP linearizing about the most currently 
available estimate. Next we investigated a further enhancement of the XBP by intro¬ 
ducing a local iteration of the nonlinear measurement system using a Newton-Rhapson 
method. Here the processor is called the iterated-extended Bayesian processor (IX- 
BP) and is shown to produce improved estimates at a small computational cost in 
most cases. Examples were developed throughout to demonstrate the concepts (see 
http://www.techni-soft.net for more details). 


MATLAB NOTES 

SSPACK_PC [8] is a third-party toolbox in MATLAB that can be used to 
design classical Bayesian processors. This package incorporates the major 
nonlinear BP algorithms discussed in this chapter—all implemented in the UD- 
factorized form [21] for stable and efficient calculations. It performs the discrete 
approximate Gauss-Markov simulations using (SSNSIM) and both extended 
(XBP) and iterated-extended (IX-BP) processors using (SSNEST). The lin¬ 
earized Bayesian processor (LZ-BP) is also implemented (SSLZEST). Ensemble 
operations are seamlessly embodied within the GUI-driven framework where 
it is quite efficient to perform multiple design runs and compare results. Of 
course, the heart of the package is the command or GUI-driven post-processor 
(SSPOST) which is used to analyze and display the results of the simulations and 
processing. 

REBEL is a recursive Bayesian estimation package in MATLAB available on 
the web which performs similar operations including the new statistical-based 
unscented algorithms including the UBP (Chapter 6) including the unscented 
transformations [24], It also includes the new particle filter designs (Chapter 7) 
discussed in [25] (see http://choosh.ece.ogi.edu/rebel for more details). 
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Problems 

5.1 Derive the continuous-time BP by starting with the discrete equations of 
Table 5.1 and using the following sampled-data approximations: 

A = e AcAt &I + A c At 
B = B c At 
W = W c At 
R ww = R WcWc At 

5.2 Suppose we are given a continuous-time Gauss-Markov model character¬ 
ized by 

x(0 = A c (t)x(t) + B c (t)u(t) + W c (t)w(t) 
and discrete (sampled) measurement model such that t—> tk then 
y(t k ) = C(t k )x(tk) + v(t k ) 

where the continuous process, w(t)~J\f(0,R ww ), and v(tk)~Af(0,R vv ) with 
Gaussian initial conditions. 

(a) Determine the state mean ( m x (t )) and covariance (Pit)). 

(b) Determine the measurement mean ( m y (tk )) and covariance (Ryy(t k ))- 

(c) Develop the relationship between the continuous and discrete Gauss- 
Markov models based on the solution of the continuous state equations 
and approximation using a first order Taylor series for the state transition 
matrix, 4>(f, t (> ) and the associated system matrices. 

(, d) Derive the continuous-discrete BP using first difference approxima¬ 
tions for derivatives and the discrete (sampled) system matrices derived 
above. 

5.3 The covariance correction equation of the BP algorithm is seldom used directly. 
Numerically, the covariance matrix P(t \ t ) must be positive semidefinite, but in 
the correction equation we are subtracting a matrix from a positive semidefinite 
matrix and cannot guarantee that the result will remain positive semidefinite 
(as it should be) because of roundoff and truncation errors. A solution to 
this problem is to replace the standard correction equation with the stabilized 
Joseph form, that is, 

Pm = [i - mc(t)]P(t\t - 1 )[/ - mat)]'+K(t)R vv w'(t) 

(i a ) Derive the Joseph stabilized form. 

(b) Demonstrate that it is equivalent to the standard correction equation. 

5.4 Prove that a necessary and sufficient condition for a linear BP to be optimal 
is that the corresponding innovations sequence is zero-mean and white. 
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5.5 A bird watcher is counting the number of birds migrating to and from a par¬ 
ticular nesting area. Suppose the number of migratory birds, m(t) is modeled 
by a first order ARM A model: 

m(t) = -0.5 m(t - 1) + w(t) for w ~ Af(l0, 75) 

while the number of resident birds is static, that is, 

r(t) = r(t - 1) 

The number of resident birds is averaged leading to the expression 
y(0 = 0.5 r(t) + m(t ) + v(t) for w ~ AA(0,0.1) 

(a) Develop the two state Gauss-Markov model with initial values r(0) = 20 
birds, m(0) =100 birds, cov r(0) = 25. 

(b) Use the MBP algorithm to estimate the number of resident and migrating 
birds in the nesting area for the data set y{t) = {70,80}, that is, what is 
x(2|2)? 

5.6 Suppose we are given a measurement device that not only acquires the current 
state but also the state delayed by one time step (multipath) such that 

y(t)= Cx(t) + Ex(t- l) + u(f) 

Derive the recursive form for this associated MBP. ( Hint: Recall from 
the properties of the state transition matrix that x(t ) = <P(f, r)x(r) and 
r)=<P(r,f) 

(a) Using the state transition matrix for discrete-systems, find the relationship 
between x{t) and x(t — 1) in the Gauss-Markov model. 

(b) Substitute this result into the measurement equation to obtain the usual 
form 

y(t) = Cx(t) + v(t) 

What are the relations for C and v(t) in the new system? 

(c) Derive the new statistics for v(t)~N(fA,v,Ri,y). 

(d) Are w and v correlated? If so, use the prediction form to develop the MBP 
algorithm for this system. 

5.7 Develop the discrete linearized (perturbation) models for each of the following 
nonlinear systems ([7]): 

• Synchronous (unsteady) motor: x(t) + Cx(t) + p sinx(f) = L(t) 

• Duffing equation: x(t) + ax(t) + fix 3 (t) = F( cos cot) 

• Van der Pol equation: x(t') + ex(t)[\ — x 2 (t)\ + x(t) = m(t) 

• Hill equation: x(t) — ax{t) + flp(t)x(t) = m(t) 
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(a) Develop the LZ-BP. 

(, b ) Develop the XBP. 

(c) Develop the IX-BP. 

5.8 Suppose we are given the following discrete system 

x{t) = -co 2 x{t - 1) + sin x{t - 1) + au(t - 1) + w{t - 1) 

40 = x(t) + v(t) 

with w and v zero-mean, white Gaussian with usual covariances, R ww and R vv . 

(a) Develop the LZ-BP for this process. 

(b) Develop the XBP for this process. 

(c) Develop the IX-BP for this process. 

(i d) Suppose the parameters co and a are unknown develop the XBP such that 
the parameters are jointly estimated along with the states. (Hint: Augment 
the states and parameters to create a new state vector). 

5.9 Assume that we have the following nonlinear continuous-discrete Gauss- 
Markov model: 

40 = /1401 + g[u(t)\ + w{t) 
z(tk) = h[x(t k )\ + v(tk ) 

with w and v zero-mean, white Gaussian with usual covariances, Q and R. 

(a) Develop the perturbation model for <540 := x(t) — x*(t) forx*(0 the given 
reference trajectory. 

(b) Develop the LZ-BP for this process. 

(c) Choose x*(0 = 40 whenever 40 is available during the recursion. 
Therefore, develop the continuous-discrete XBP. 

5.10 Suppose we assume that a target is able to maneuver, that is, we assume that 
the target velocity satisfies a first order AR model given by: 

v x (t) = —av r (t - 1) + w r (t - 1) for w ~ .4(0, R WzWl ) 

(a) Develop the Cartesian tracking model for this process. 

(b) Develop the corresponding XBP assuming all parameters are known 
a priori. 

(c) Develop the corresponding XBP assuming a is unknown. 

5.11 Nonlinear processors {LZ-BP, XBP, IX-BP) can be used to develop neural net¬ 
works used in many applications. Suppose we model a generic neural network 
behavior by 


x(t) = x{t - 1) + w{t - 1) 
y{t) = c[x(t),u(t),a(t)\ + v(t) 
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where x(t) is the network weights (parameters), u(t ) is the input or training 
sequence, a(t) is the node activators with w and v zero-mean, white Gaussian 
with covariances, R ww and R vv . 

(a) Develop the LZ-BP for this process. 

(b) Develop the XBP for this process. 

(c) Develop the IX-BP for this process. 

5.12 The Mackey-Glass time delay differential equation is given by 

ax(t — t) 

y(T) = x(t) + v(t) 

where a, P are constants, AT is a positive integer with w and v zero-mean, 
white Gaussian with covariances, R ww and R vv . For the parameter set: a = 0.2, 
fi = 0.1, r = 7 and AT = 10 with x(0) = 1.2 

(a) Develop the LZ-BP for this process. 

(b) Develop the XBP for this process. 

(c) Develop the IX-BP for this process. 

5.13 Consider the problem of estimating a random signal from an AM modulator 
characterized by 


s(t) = \f2Pa(t) sin m c t 
r(t ) = s(t) + v(t) 


where a(t ) is assumed to be a Gaussian random signal with power spectrum 


Saaico) = 


2 k a Pa 


also assume that the processes are contaminated with the usual additive noise 
sources: w and v zero-mean, white Gaussian with covariances, R ww and R vv . 

(a) Develop the continuous-time Gauss-Markov model for this process. 

(b) Develop the corresponding discrete-time Gauss-Markov model for this 
process using first differences. 

(c) Develop the BP. 

(d) Assume the carrier frequency, co c is unknown. Develop the XBP for this 
process. 
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6.1 INTRODUCTION 

In this chapter we discuss an extension of the approximate nonlinear Bayesian suite 
of processors that takes a distinctly different approach to the nonlinear Gaussian 
problem. Instead of attempting to improve on the linearized approximation in the 
nonlinear XBP ( EKF) schemes discussed in the previous section or increasing the 
order of the Taylor series approximations [1-11] a modem statistical (linearization) 
transformation approach is developed. It is founded on the idea that “it is easier to 
approximate a probability distribution, than to approximate an arbitrary nonlinear 
function of transformation” [3, 12-21]. The classical nonlinear Bayesian processors 
discussed so far are based on linearizing nonlinear functions of the state and mea¬ 
surements to provide estimates of the underlying statistics (using Jacobians), while 
the statistical transformation approach is based on selecting a set of sample points 
that capture certain properties of the underlying distribution. This transformation is 
essentially a “statistical linearization” technique that incorporates the uncertainty of 
the prior random variable when linearizing [12], This set of sample points is then non- 
linearly transformed or propagated to a new space. The statistics of the new samples 
are then calculated to provide the required estimates. Note that this method differs 
from the sampling-resampling approach, in which random samples are drawn from 
the prior distribution and updated through the likelihood function to produce a sample 
from the posterior distribution [22], Here the samples are not drawn at random, but 
according to a specific deterministic algorithm. Once this transformation is performed, 
the resulting processor, the sigma-point Bayesian processor ( SPBP ) or equivalently 
the unscented Kalman filter ( UKF ) evolves. It is a recursive processor that resolves 
some of the approximation issues [14] and deficiencies of the XBP of the previous 
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sections. We first develop the idea of nonlinearly transforming a random vector with 
known probability distribution and then apply it to a Gaussian problem leading to the 
SPBP algorithm. We then apply the modem processor to the previous nonlinear state 
estimation problem and compare its performance to the classical. 


6.2 SIGMA-POINT (UNSCENTED) TRANSFORMATIONS 

A completely different approach to nonlinear estimation evolves from the concept of 
statistical linearization [3,16,19,24], Instead of approximating the nonlinear process 
and measurement dynamics of the underlying system using a truncated Taylor series 
representation that leads to the classical forms of estimation (LZKF, EKF, IEKF, etc.), 
the statistical approximation or equivalently statistical linearization method provides 
an alternative that takes into account the uncertainty or probabilistic spread of the 
prior random vector. The basic idea is to approximate (linearize) a nonlinear function 
of a random vector while preserving its first and second moments; therefore, this 
approach requires a priori knowledge of its distribution resulting in a more statistically 
accurate transformation. 

6.2.1 Statistical Linearization 

Following Gelb [3], statistical linearization evolves from the idea of propagating an 
ATf-dimensional random vector x with PDF, px(x), through an arbitrary nonlinear 
transformation a[ ] to generate a new random vector, 

y = a[x] (6.1) 

Expanding this function in a Taylor series about x,- gives 

a[x] = a[x ( ] + (x - x,)'V x a[x,] + H.O.T. (6.2) 

where V x is the A^-gradient vector defined by 



Neglecting the H.O.T and simplifying with the appropriate definitions, we obtain 
the “regression form” of y regressing on x 

y = a[x] ^ Ax + b (6.4) 

where both A, b are to be determined given x ~ px(x). Estimation of A, b follows 
from the traditional linear algebraic perspective by defining the approximation or 
linearization error as: 


:= y - Ax - b = a[x] - Ax - b 


(6.5) 
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along with the corresponding cost function 

J e := E{e'e\ (6.6) 

that is minimized with respect to the unknowns A,b. Performing the usual differ¬ 
entiation of the cost function, J e -*■ J e (A, b), with respect to the unknowns, setting 
the results to zero and solving generates the MMSE result discussed previously in 
Chapter 2. That is, we first minimize with respect to b 

V b / e (A, b) = V b E{e'e} = £{V b (y - Ax - b)'e} = 0 (6.7) 

Now using the chain rule of Eq. 2.11 with a! = (y — Ax — b)' and b = e we obtain 

V b / e (A, b) = E{ V b (y - Ax - b)'e} = -2 E{(y - Ax - b)} = 0 

Solving for b, we obtain 

b = n y — A/a x (6.8) 

Furthermore, substituting for b into the cost and differentiating with respect to A, 
setting the result to zero and solving yields 

V A J e (A,b)\b=ny-Anx = E (Mx - n x )(x - n x )' + (y ~ fx y )(x - ij. x )'} = 0 

giving 

AE{(x - n x )(x - ji x )'} = E{(y - n y )(x - fi x )'} 


or 


AR XX = R yx 

that has the MMSE solution 

A = R yx R~ l = R’ xy R-' (6.9) 

Now suppose we linearize the function through this relation by constructing a linear 
regression between a selected set of (V* -points and the nonlinear transformation, a[x], 
at these selected points. That is, defining the set of the selected points as (A), (T)) with 

yt = a m 

Following the same approach as before by defining the “pointwise” linearization 
error as 


with 


e; = y t - AXi - b = a [XU - AX t - b 


( 6 . 10 ) 


/x c = 0, and R ee = R yy — AR XX A' 


(6.11) 
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Performing a weighted minimization (as above) on the pointwise linearization error 
with a set of regression weights, {TP/}; fc 1_ ,N X yields the following solution: 

A = R' xy Rf x , b = p. y - Ap x (6.12) 

where the weighted statistics are given by 

N x 

i l x = Y W ‘ X ‘ 
i= 1 
N x 

Rxx = Y W * X i ~ l^ (X ‘ ~ ^ (6- 13 > 

i'=l 

and for the posterior 

Nx 

Hy = Y 

N x 

Ryy = Y W ‘ {y i ~ A*y)CV» - dyY (6.14) 

with the corresponding cross-covariance 

N x 

Rxy = Y W ‘ (X ‘ - - Mv)' (6.15) 


With this underlying regression and pointwise transformation in place, the next step 
is to determine the corresponding set of regression (sigma) points {A}} and their 
corresponding weights, {TP/}. 

6.2.2 Sigma-Point Approach 

The sigma-point transformation ( SPT) or equivalently unscented transformation 
( UT) is a technique for calculating the statistics of a random vector that has been non- 
linearly transformed. The approach is illustrated in Fig. 6.1. Here the set of samples or 
so-called sigma points are chosen so that they capture the specific properties of the 
underlying distribution. In the figure we consider px(x) to be a 2D-Gaussian, then 
the cr-points are located along the major and minor axes of the covariance ellipse 
capturing the essence of this distribution. In general, the goal is to construct a set 
of cr-points possessing the same statistics as the original distribution such that when 
nonlinearly transformed to the new space, then new set of points sufficiently capture 
the posterior statistics. The transformation occurs on a point-by-point basis, since 
it is simpler to match statistics of individual points rather than the entire PDF. The 
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> Nonlinear 
Transformati 



FIGURE 6.1 Unscented transformation. A set of distribution points shown on an error 
ellipsoid are selected and transformed into a new space where their underlying statistics 
are estimated. 


statistics of the transformed points are then calculated to provide the desired estimates 
of the transformed distribution. 

Following the development of Julier [15], consider propagating an N x -dimensional 
random vector, x, through an arbitrary nonlinear transformation a[-] to generate a new 
random vector, 


y = a[x] 


(6.16) 


The set of rr-points, {AT,}, consists of N„ + 1 vectors with appropriate weights, 
{IT,}, given by £ = {AT,-, IT ,;i = 0, ..., N„\. The weights can be positive or negative 
but must satisfy the normalization constraint 



so that the estimate of the statistics remains unbiased. The problem then becomes: 

GIVEN the sigma-points, £ = {AT), IT,; i = 0,..., N a ), and the nonlinear transfor¬ 
mation a[x], FIND the statistics of the transformed samples, 


dy = E{y) and R vv = Cov(y) 


The sigma-point (unscented) transformation ( SPT ) approach to approximate the 
statistics, (p y ,R y y) is: 

1. Determine the number, weights and locations of the er-point set, E, based on 
the unique characteristics of the prior distribution 
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2. Nonlinearly transform each point to obtain the set of new (posterior) 
cr-points, {»}: 


= a [*,] 

3. Estimate the posterior mean by its weighted average 1 


No 

% = E 

i =0 


(6.17) 


(6.18) 


4. Estimate the posterior covariance by its weighted outer product 

No 

R vv = E W ‘ (y ‘ - My)W - My)' (6.19) 

i'=0 

One set of cr-points that satisfies the above conditions consists of a symmetric set 
of N„ = 2 N x + 1 points that lie on the v^-th covariance contour [13]: 

X 0 = fJL x , 

Xi = /u, x + ^/Nx Oj 

X i+N x — Hx~ Vi 

where x/Nx Oi is the i th standard deviation scaled by \fNx and Wi is the weight 
associated with the i th a -point. 

Thus, the cr-point transformation can be considered a statistical linearization 
method that provides an optimal (MMSE) linear approximation to a general nonlinear 
transformation taking into account the prior second-order statistics of the underlying 
random variable, that is, its mean and covariance [19], It can be accomplished using 
the weighted statistical linear regression approach (WSLR) [16, 19], resulting in the 
weight-constrained estimates above. 

Therefore, in contrast to random sampling, selection of the “deterministic” 
cr-points requires resolving the following critical issues: 

• N c r, the number of cr-points; 

• Wj, the weights assigned to each cr-point; and 

• Xi, the location of the cr-points. 

That is, we must answer the questions of: How many (points)?. What (weights)? and 
Where (located)? 


1 Note that this estimate is actually a weighted statistical linear regression (WSLR) of the random variable 
[19]. 
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The cr-points should be selected or constrained to capture the “most important” 
statistical properties of the random vector, x. Let the underlying prior px(x) be its 
density function, then the cr-points capture the properties by satisfying the necessary 
(constraint) condition 

g[E,p z (x)] = 0 (6.20) 

Since it is possible to meet this constraint condition and still have some degree of 
freedom in the choice of cr-points, assigning a penalty function 

p[£,Px(x)] (6.21) 

resolves any ambiguity in the choice. This function is to incorporate desirable features 
that do not necessarily have to be satisfied. Decreasing the penalty function leads to 
more and more desirable solutions. In general, the cr-point set relative to this problem 
that is the most desirable is the set that conforms to the necessary conditions of 
Eqs. 6.20 and 6.21. The cr-points must satisfy 

nun p[£, p x (x)] 9 g[E, p x (x)] = 0 (6.22) 

The decision as to which properties of the random vector x to capture precisely or 
approximate depends on the particular application. For our problems, we wish to 
match the moments of the underlying distribution of cr-points with those of x. 

Summarizing, to apply the SPT approach: (1) a set of cr-points, E, are constructed 
that are “deterministically” constrained to possess the identical statistics of the prior; 
(2) each cr-point is nonlinearly transformed to the new space, yp, and (3) the statis¬ 
tics of the transformed set are approximated using WSLR techniques. The following 
example illustrates this approach. 


Example 6.1 

Consider a scalar random variable, x, for which we would like to propagate its first two 
moments iji x ,R xx ) through a nonlinear transformation,» = a [X{\. The corresponding 
set of constraint equations are defined by: 


g(E,p x (x)) = 


TZoWi-l 

Efio WM - p x 

W ‘( X ‘ - Px)(Xi - dx)' - 


N„ 

p y = w,y, 

;=o 

No 

R yy = E W ‘ (y ‘ ~ ~ %)' 

i=0 


with posterior 
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Suppose x~ N(0, 1), then the set of a -points are defined by N a = 2 or 3 points 
with Xq = (Tq = 0, X\ = — o \, and X2 = aj. The constraint equations become 


W 0 + Wi + W 2 - 1 = 0 
— Wicri + W202 — 0 = 0 
W x o\ + W 2 crf -1 = 0 

for Wo, a free parameter whose value affects the 4 th and higher moments. These 
relations are not uniquely satisfied (4 unknowns, 3 equations); therefore, we must add 
another constraint to the set based on the property that the skew is zero for the Gaussian 

-Wiof + W 2 a\ = 0 

Solving these equations and using the symmetry property of the Gaussian gives: 


0t = t/ywi, Wi = (1 - W 0 )/2, cr 2 = tT U W 2 = Wi AAA 


This example illustrates the use of the constraint equations to determine the 
cr-points from properties of the prior. Next, let us incorporate the nonlinear transfor¬ 
mation and penalty function to demonstrate the entire selection procedure. Following 
Julier [13] consider the following one-dimensional example. 

Example 6.2 

As before, suppose we have a scalar random variable x, Gaussian distributed with 
mean /i x and variance <r x , and we would like to know the statistics (mean and variance) 
of y which nonlinearly transforms x according to 

y = a[x] = x 2 

Here the true mean and variance are 

= E{y] = E{x 2 } = a 2 + fij 
and 

cfy = E{y 2 } -n] = E{x 4 } -n] = (3<r* + 6 o 2 x[ i 2 x + n*) - (a 4 + 2 o 2 n 2 x + fx 4 ) 

= 2cr^ + 4o 2 fx 2 

According to SPT of Eqns. 6.17-6.19 the number of cr-points is N a = 3. Since this is 
a scalar problem (N x =1) only 3 points are required: the two cr-points and the mean, 
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therefore, we have 

{*0, <*1, ^2} = {ike, /At — a/1 + K O'*, fl x + Vl + K O x ) 

{Wo, W U W 2 } = {1/(1 + *), 1/2(1 + K), 1/2(1 + *)} 

and «■ is chosen as a scaling factor to be determined. Propagating these samples 
through a[-] gives the transformed samples, say X[ that lie at 

{*', X{, X'} = j ml, (fix - a x ) 2 , (fix + VT+^ a,) 2 } 

The mean of y is given by 

% = ztt - "—-(2 Kfi x + 2 fi 2 + 2(1 + K)(T 2 ) = fi 2 x + ol 

2(1 + k) 

which is precisely the true mean. Next the covariance is given by 

a y = 2 (\+K) ((X ' 1 ~ Mv)2 + 2 ^ } ) = K ° A + 4 ^ a * 

To find the solution, k must be specified. The kurtosis of the true distribution is 2er^ 
and that of the er-points is a 2 . Since the kurtosis of the points is scaled by an amount 
(1 + k), the kurtosis of both distributions only agree when k = 2 which gives the exact 
solution. This completes the Gaussian example. AAA 

Next using the underlying principles of the SPT, we develop the multivariate 
Gaussian case in more detail. 

6.2.3 SPT for Gaussian Prior Distributions 

To be more precise and parallel the general philosophy, we choose “to approximate 
the underlying Gaussian distribution rather than approximate its underlying nonlin¬ 
ear transformation,” in contrast to the XBP ( EKF ). This parameterization captures 
the prior mean and covariance and permits the direct propagation of this information 
through the arbitrary set of nonlinear functions. Here we accomplish this (approxi¬ 
mately) by generating the discrete distribution having the same first and second (and 
potentially higher) moments where each point is directly transformed. The mean and 
covariance of the transformed ensemble can then be computed as the estimate of 
the nonlinear transformation of the original distribution. As illustrated in Fig. 6.2, 
we see samples of the true prior distribution of x and the corresponding nonlinearly 
transformed distribution of y. Using the same transformation, the selected er-points 
are transformed as well closely preserving the dominant moments of the original 
distribution (see Julier [13] for more details). 
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| Sigma point transformation 


Sigma points 




Covariance 


'Mean 


I 


y = a\x\ 


Ji = a[xj 



True covariance 


True mean 


Sigma poii 



• Transformed sigma points 


Sigma point covariance 


FIGURE 6.2 Sigma point (unscented) transformation approximation of the original 
distribution moments after nonlinear transformation. 

It is important to recognize that the SPThas specific properties when the underlying 
distribution is Gaussian [16]. The Gaussian has two properties which play a significant 
role in the form of the a -points selected. First, since the distribution is symmetric, the 
cr-points can be selected with this symmetry. Second, the problem of approximating x 
with an arbitrary mean and covariance can be reduced to that of a standard zero-mean, 
unit variance Gaussian, since 


x = /z x + Us for U the matrix square root of R xx 


(6.23) 


where s ~A r (0,/). Therefore, in the Gaussian case, the second order SPT uses a set 
of cr-points which capture the first two moments of s correctly, that is, they must 
capture the mean, covariance and symmetry. Let s, be the i th component of s, then its 
covariance is given by 


E{sf] = 1 Vi 


(6.24) 


Also from the symmetry properties of the distribution, all odd -ordered moments 
are zero. 

The minimum number of points whose distribution obeys these conditions has two 
types of cr-points: (1) a single point at the origin of the s-axis with weight, W 0 \ and 
(2) 2N X symmetrically distributed points on the coordinate s-axis a distance r from 
the origin all having the same weight, W \. Thus, there are N a = 2 N x + 1 cr-points for 
a two-dimensional distribution. The values of W a , W\ and r are selected to ensure that 
their covariance is the identity. Therefore, due to their symmetry, it is only necessary 
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to specify one direction on the s-axis, say .v i. The constraint function will consist of 
the moment for E{s\} and the normalization condition must be satisfied. Therefore, 
we have that 

g[„,p,( S )] = ,)=« (6.25) 

The solution to these equations is given by 


and W„ = ] - 2N X W \, 


By reparameterizing W\ , K) , then it can be shown that after pre-multiplying 

by U, the matrix square root of R xx , that the a -points for x are: 


= llX, W 0 

Xi = fx x + (x/(N x + K)R xx ) t , Wi 

X i+Nx =» x ~ ( s/(N x + K)R xx )i, W i+Nx 

where k is a scalar, (V( N x + k) R xx )i is the i th row or column of the matrix square root 
of (N x + k)R xx and Wi is the weight associated with the i' h <r-point. The parameter k is 
free; however, it can be selected to minimize the mismatch between the fourth-order 
moments of the er-points and the true distribution [16]. From the properties of the 
Gaussian, we have that 

E{s 4 } = 3 Vi (6.27) 

The penalty function penalizes the discrepancy between the er-points and the true 
value along one direction (xj), in this case, due to the symmetry. Therefore, we 
have that 

p[cr, p z (s)] = \2W\r 4 - 3| giving Wf = ^ ^ or k = N x - 3 (6.28) 

It is clear that the ability to minimize p depends on the number of degrees of freedom 
that are available after the constraint g is satisfied for the given set of er-points; the 
kurtosis cannot be matched exactly without developing a larger set of er-points (see 
Julier [16] for details). 

We summarize the sigma-point processor under the multivariate Gaussian assump¬ 
tions: Given an -dimensional Gaussian distribution having covariance R xx we can 
generate a set of 0(N a ) er-points having the same sample covariance from the columns 
(or rows) of the matrices ±"j(N x + k)R xx . Here k is the scaling factor discussed previ¬ 
ously. This set is zero mean, but if the original distribution has mean /r*, then simply 
adding \i x to each of the er-points yields a symmetric set of N„ = 2N x + 1 samples 


(N x + k) 

1 

= 2 (N x + k) 

1 

= 2{N X + k) 
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having the desired mean and covariance. Since the set is symmetric, its odd cen¬ 
tral moments are null; so its first three moments are identical to those of the original 
Gaussian distribution. This is the minimal number of a -points capable of capturing the 
essential statistical information. The basic SPT technique for a multivariate Gaussian 
distribution [16] is therefore: 

1. Determine the set of N a = 2N x + 1 er-points from the rows or columns of 
iVdCTS. For the nonzero mean case compute, Xj = a + p, x \ 

Xo = Hx, 

Xi = l-lx + (7 (N x + KWxx). 

Xi+N x = Hx - (V(N X + KWxx). 

where k is a scalar, (■ S /(N X + k)R xx ) . is the i th row or column of the matrix square 
root of (N x + k)R xx and Wj is the weight associated with the i th cr-point; 

2. Nonlinearly transform each point to obtain the set of new er-points: fi — a[Xj] 

3. Estimate the posterior mean of the new samples by its weighted average 
(regression) 

2 N x 

% = £ w <y< 

i=0 

4. Estimate the posterior covariance of the new samples by its weighted outer 
product (regression) 


W i+Nx = - 


R vv 


2 N x 

= J2 w ‘ (y ‘ ~ - My)' 


i= 0 


There is a wealth of properties of this processor: 

1. The transformed statistics of y are captured precisely up to the second order. 

2. The er-points capture the identical mean and covariance regardless of the choice 
of matrix square root method. 

3. The posterior mean and covariance are calculated using standard linear algebraic 
methods ( WSLR ) and it is not necessary to evaluate any Jacobian as required by 
the XBP methods. 

4. k is a “tuning” parameter used to tune the higher order moments of the approxi¬ 
mation that can be used to reduce overall prediction errors. For x a multivariate 
Gaussian, N x • + k = 3 is a useful heuristic. 

5. A modified form for k—^X = a 2 (N x + k) — N x (scaled transform) can be used 
to overcome a nonpositive definiteness of the covariance. Here a controls the 
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spread of the er-points around ji x and is typically set to a value 0.01 < a < 1 
with k a secondary scaling parameter set to 0 or 3 — N x and /3 is an extra degree 
of freedom to incorporate any extra prior knowledge of the px(x) with (5 = 2 
for Gaussian distributions. In this case the weights change and are given by: 
= a£x. < C) = + d - * 2 + P). and w\ m) = ^ [17, 21, 24], 

6. Note that although statistical linearization offers a convenient way to interpret 
the subsequent sigma-point approach, it does not indicate some of its major 
advantages, especially since it is possible to extend the approach to incorporate 
more points capturing and accurately propagating higher order moments [24], 

Next we apply the er-point approach to the nonlinear filtering problem by defining 
the terms in the SPT and showing where the statistical approximations are utilized. 


6.3 SIGMA-POINT BAYESIAN PROCESSOR 
(UNSCENTED KALMAN FILTER) 

The SPBP or UKF is a recursive processor developed to eliminate some of the defi¬ 
ciencies created by the failure of the linearization process to first order (Taylor series) 
in solving the state estimation problem. Different from the XBP (EKF ), the cr-point 
processor does not approximate the nonlinear process and measurement models, it 
employs the true nonlinear models and approximates the underlying Gaussian distri¬ 
bution function of the state variable using a statistical linearization approach leading 
to a set of regression equations for the states and measurements. In the sigma-point 
processor, the state is still represented as Gaussian, but it is specified using the minimal 
set of deterministically selected samples or cr-points. These points completely capture 
the true mean and covariance of the prior Gaussian distribution. When they are propa¬ 
gated through the nonlinear process, the posterior mean and covariance are accurately 
captured to the second order for any nonlinearity with errors only introduced in the 
third- and higher order moments. This is the statistical linearization using the weighted 
(statistical) linear regression approximation ( WSLR ) discussed previously [16]. 

We use the XBP (EKF) formulation and its underlying statistics as our prior dis¬ 
tribution specified by the following nonlinear model with the conditional Gaussian 
distributions. Recall that the original discrete nonlinear process model is given by 

x(t) = a[x(t - 1)] + b [u(t - 1)] + w(t - 1) (6.29) 

with corresponding measurement model 

y(t) = c[x(t)\ + v(t) (6.30) 

for w ~ Af( 0, R ww ) and v ~ Af( 0, R vv ). It was demonstrated previously (Sec. 5.3) that 
the critical conditional Gaussian distribution for the state variable statistics was the 
prior 

Pr(x(t)|F r _i) = U(x(t\t - 1 ),P(t\t - 1)) 
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with the measurement statistics specified by 

Pr(y(/)| ,) = M(y(t\t ~ l),%(f|f - 1)) 

where x(t\t — 1), and P(t\t — 1) are the respective predicted state and error covariance 
based upon the data up to time (t — l)andy(f|t — I), R^(t\t — 1) are the predicted mea¬ 
surement and residual covariance. The idea is to use the “prior” statistics and perform 
the SPT (under Gaussian assumptions) using both the process and measurement non¬ 
linear transformations (models) as specified above yielding the corresponding set of 
er-points in the new space. The predicted means are weighted sums of the transformed 
er-points and the covariances are merely weighted sums of their mean-corrected, outer 
products, that is, they are specifically the WSLR discussed in Sec. 6.2. 

To develop the sigma-point processor we must: 

• PREDICT the next state and error covariance, {x{t\t— 1), P(t\t — 1)), by SPT 
transforming the prior, Af(x(t — 1| t— 1), P(t — 1| t— 1)), including the process 
noise using the er-points, Xj(t\t — 1) and Xi(t — 1 \t — 1), respectively; 

• PREDICT the measurement and residual covariance, [y{t\t — 1), R^(t\t — 1)] 
by using the SPT transformed er-points yt(t\t — 1) and performing the weighted 
regressions; and 

• PREDICT the cross-covariance, R x %(t \ t — 1) in order to calculate the correspond¬ 
ing gain for the subsequent update step. 

We use these steps as our road map to develop the sigma-point processor depicted 
in the block diagram of Fig. 6.3. We start by defining the set, E, and selecting 
the corresponding, N a = 2 N x + 1 er-points and weights according to the multivariate 
Gaussian distribution selection procedure developed in the previous section. That is, 
the N a points are defined by substituting the er-points for the SPT transformation of 
the “prior” state information specified by fx x =x(t— 1 | t— I) and R xx = P(t — l|f — 1). 
We define the set of er-points and weights as: 

X 0 = fi x =x(t- \\t- 1), 

Xi = btx + (y(N x + K)R xx y 

= x{t- \\t- 1)+ (J(N X +K)p(t- l\t- 1)^ 

X i+Nx = /C - (7 (N x + kWxc). 

= X(t - l |t - 1) - [J(N X + K)p(t- l|r- l)j 

Next we perform the state prediction-step to obtain, {Xj(t\ t — 1), x(t\t— 1)}, trans¬ 
forming each er-point using the nonlinear process model to the new space, that is, to 







I 


jf It III I 

frr 3i r«: 

i,I i i II 

r- 5 ~ F?1 

| i ' : 1 : 

if! ill 

B HI lil ! 

i i i 
i f i if i 

L?J III r 

i s£g! 5 
- 

if] if: m ‘ 

-i i - SJ III 1 ? 1 § 

!._iTUU_ * 


211 






























212 MODERN BAYESIAN STATE-SPACE PROCESSORS 


obtain the one-step state prediction as 

Xi(t\t - 1) = a [Xi(t - 1 |t - 1)] + b [u(t - 1)] (6.31) 

The WSLR approximation step to obtain the predicted mean follows from the 
statistical linearization transformation model using the following relations: y —> x(t), 
x —> x{t — 1), A —> A(t — 1), b ->■ b(f — 1) and e -»■ e(f — 1) of Eq. 6.4 

x(t) = A(t - 1 )x(t - 1) + b (t- 1) + e(f-^l) (6.32) 

Linear Approximation Linearization Error 

Conditioning Eq. 6.32 on Y t -\ and taking expectations, we have 

m~l) = E{x(.t)\Y t - l } = E{A(t — l)x(t — l)|F ? _i} + E{b(t — l)|T f _i} 

+ E{e(t — l)|T / _i} 


or 

x(t\t - 1) = A(t - \)x(t — 11/ — 1) -E b(/ — 1) (6.33) 

Using this linearized form, with the conditional means replacing the unconditional, 
that is, py -> x(t\t — 1) and p x —*■ x(t — I \t — 1), we have from Eq. 6.8 that 

b(t — 1) x{t\t — 1) — A(t — Y)x{t — \\t — 1) 

2 N x 

= WiXi(t\t - 1) - A{t - l)x(f - l|t - 1) (6.34) 

!=0 

where we substituted the regression of Eq. 6.14 for fi y . Substituting this expression 
for b above gives the required regression relation 

2 N x 

x(t\t - 1) = A{t - Y)x(t - \\t - 1) + £2 WiXi(t\t - 1 )-A(t - \)x{t - \\t - 1) 

i=0 

2N X 

0J2 w iXi(t\t- 1) (6.35) 

i=0 

yielding the predicted state estimate using the WSLR approach shown in Fig. 6.3. 
Next let us define the predicted state estimation error as 

Xj(t\t - 1) := Xi(t\t - 1) - x(t\t - 1) (6.36) 

therefore, the corresponding predicted state error covariance, 

P(t\t - 1) := cov[Xi(t\t - 1)] = cov[Xi(t\t - 1) - x(t\t - 1)] 
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can be calculated from 

P(t\t - 1) = COv[(A(f - 1 )Xi(t - l\t - 1) + b(f - 1) + €i(t - 1)) 

-(A(t- \)x(t- I |t — 1) + b(t - 1))} 

= CO Y[A(t - 1 )Xi(t - 1| t - 1) + €i(t - 1)] 

Performing this calculation gives the expression 

P(t\t - 1) = A(t - l)P(t\t - 1 )A'(t - 1) + R ee (t - 1) (6.37) 

which can also be written in the regression form by using the relations from Eq. 6.11 
with y —> Xi, Ryy —> Pxx- Therefore, substituting for APA' above, we have 

P(t\t - 1) = P XX (t\t - 1) - RcAt - 1) + Ree(t - 1) 

2N X 

= J2 WjXi(t\t - l)Xj(t\t - 1) (6.38) 

(=0 

In the case of additive, zero-mean, white Gaussian noise with covariance, 
R ww (t — 1), we have the final WSLR 

2N x 

P{t\t - 1) = J2 WiXj(t\t - l)X{(t\t - 1) + R ww (t - 1) (6.39) 

/=o 

completing the state error prediction step of Fig. 6.3. 

Here, the Bayesian processor approximates the predicted density, N(x(t\t— 1), 
P{t\t — 1)), where 

p z (x(0|T«-i) = J Px(x(t)\x(t - l)p x (x(t - l)|T,_i) dx(t - 1) (6.40) 

Next we calculate a new set of er-points to reflect the prediction-step and perform 
the SPT in the measurement space as 

Xi(t\t-l)={Xi(t\t-l), Xi{t\t-\) + Ky/R ww {t-\), Xi(t\t~ l) — Ky/R ww (t — 1)} 

(6.41) 

We linearize the measurement function similar to the nonlinear process function in 
the previous steps. We use the nonlinear measurement model to propagate the “new” 
cr-points to the transformed space producing the measurement prediction step shown 
in Fig. 6.3. 

y,(t\t - 1>~ c[A)(r|f - 1)] (6.42) 

The WSLR is then performed to obtain the predicted measurement, that is, 


y(t) = C(t)x(t) + bit) + c(t) 


(6.43) 
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Conditioning on F f _i, as before, and taking expectations of Eq. 6.43 gives 

KAt - 1) = E{y{t)\Y, ,} = C(t)x(t\t - 1) + b(7) (6.44) 

Using this linearized form with the conditional means replacing the unconditional, 
that is, n y ^-y(t\t— l),A->- C,b^-band x(t\t — 1), we have fromEq. 6.8 that 

2 N x 

b(0 -> %t\t - 1) - C(t)x(t\t ~ 1) = J2 W i y &- !) - C(t)x(t\t ~ 1) (6.45) 
/=o 

where we substituted the regression of Eq. 6.14 for /x y . Substituting this expression 
for b above gives the WSLR relation 


2 N x 

y(t\t -1) = mm - 1) + £ wMv -1) - mm - o 

i=0 
2 N x 

= ^ W iy£t\t - \) (6.46) 


yielding the predicted measurement estimate of Fig. 6.3. 

Similarly, the residual (measurement) error is defined by: 

mt -1) : = ym -1) - m -1) (6.47) 

and therefore the residual (predicted) covariance can be expressed as 

R&(t\t - 1) = cov[$i(t\t ~ 1)] = cov[^a|T - 1) - m ~ 1)] 

Substituting the linearized model above, we obtain 

%(f \t -1) = covKmxm -1)+ba)+ m) 

- (mm -1)+b(0)] 

= COv[C(f - l)Xi(t\t - 1) + 6,(0] 

Performing this calculation gives the expression 

%(f|t - 1) = C(t)P(t\t - 1 )C'(t) + Rm (6.48) 

which can also be written in the regression form by using the relations from Eq. 6.11 
with y —»• Ryy —> Pg. Therefore, substituting for CPC' above, we have 

2 N x 

%(f|f - 1) = P HH (t\t - 1) - R f Jt) + Rm = J2 WMt\t - l%(t\t - 1) (6.49) 

i=0 
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In the case of additive, zero-mean, white Gaussian noise with covariance, R vv (t) 
we have the final WSLR 


2 N x 

R&(t\t - 1) = - l#('l' - 1) + Rw(t) (6.50) 


completing the measurement residual prediction step of Fig. 6.3. 

The gain is estimated from 

lC(t) = R%{t\t - l)Rg(t\t - 1) (6.51) 

where the cross error covariance is approximated using the WSLR as (above) 

2N X 

*5*01/ - 1 ) = J2 WiXi(t\t - \%{t\t - 1) (6.52) 

(=0 

With this in mind it is clear that the posterior distribution can be calculated from 

PvWOI Y t ) ~ AT(x(t\t), *010) = J \> x (x(t)\x{t - 1) 

xpjWl- l)\Y t _i) dx(t - 1) (6.53) 

where we have the “usual” (Kalman) update relations of Fig. 6.3 starting with the 
innovations 

e(t) = y{t)-y{t\t-\) (6.54) 

and the state update 

*010 = x(t\t - 1) + K{t)e{t) (6.55) 

along with the corresponding state error covariance update given by 

Pm = Pmt -1) - icmmt - wo (6.56) 

This completes the SPBP algorithm which is summarized in Table 6.1. We have 
shown how the SPT coupled with the WSLR can be used to develop this technique in its 
“plain vanilla” form. We note in passing that there are no Jacobians calculated and the 
nonlinear models are employed directly to transform the er-points to the new space. 
Also in the original problem definition (see Sec. 6.2) both process and noise sources, 
{w, v ) were assumed additive. The SPT enables us to “generalize” the noise terms to 
also be injected in a nonlinear manner (e.g. multiplication). Thus, the noise is not 
treated separately, but can be embedded into the problem by defining an augmented 
state vector, say x(t) = [x{t) w{t — 1) v(t)]'. We chose to ignore the general case to 
keep the development of the sigma-point processor simple. For more details of the 
general process see the following references [17,21,24]. Let us apply the sigma-point 
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TABLE 6.1 Discrete Sigma-Point Bayesian Processor (Unscented Kalman Filter) 
Algorithm 


State: Sigma Points and Weights 


X 0 =x(t-\\t-\\ 

Xt =xif - 1 |f - 1) + (J(N X +K)P{t-\\t- 1)) , 


% 


X i+Nx =x(t-\\t-Y)- ^(N x + k) P(t - 11 1- 1 )J 
State Prediction 

Xi(t\t- \) = a[X,(i- l|f-l)]+ b[n(f- 1)] 

Store Error Prediction 


(nonlinear state process) 
(state regression) 


X,(t\t - 1) = Xi(t\t - 1) -x(t\t - 1) (state error) 

2iv* 

P(f \t — 1) = E WiXj(t\t — \)X'(t\t — 1) + R ww (t — 1) (error covariance prediction) 

Measurement: Sigma Points and Weights 

Xi(t\t- \) = {Xi(t\t-\), XMt- \) + KjR ww (t-\), X,(f\t- \)->cjR ww (t-\)} 


Measurement Prediction 

y i {t\t-\)=cix i (t\t-\y\ 

.y(/|/-i)=E^,(r|/-i) 

1=0 

Residual Prediction 


(nonlinear measurement) 
(measurement regression) 


R&(t\ t -1) = e wtmt - mm - 1)+*«-(*) 


(predicted residual) 
(residual covariance regression) 


%(r|r - 1) = E WiXi(t\t- \%{t\t - 1) 

i=0 

iC(t)=%(t|f-l)P^lM} 

State Update 

e(t) = y(t)-y(t\t- I) 

x(t|r)=i(r|r-i) + /C(0e(0 

Pm=P(t\t - 1) - /C(0%(r|f - 1 )K\t) 


(cross-covariance regression) 
(gain) 


(innovation) 
(state update) 
(error covariance update) 


ic(0|0) P(0|0) 


Initial Conditions 
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processor to the nonlinear trajectory estimation problem and compare its performance 
to the other nonlinear processors discussed previously. 

Example 6.3 

We revisit the nonlinear trajectory estimation problem of the previous examples with 
the dynamics specified by the discrete nonlinear process given by 

x(t) = (1 - 0.05 A7>(/ - 1) + 0.04A7x 2 (f - 1) + w(t - 1) 

and corresponding measurement model 

y(t) = x 2 (t) + x 3 (t) + v(t) 

Recall that v(t) ~ JV(0,0.09), x(0) = 2.0, P{ 0) = 0.01, AT = 0.01 sec and R ww = 0. 
The simulated measurement is shown in Fig. 6.4b. The sigma-point processor ( SPBP ) 
and XBP ( EKF) and IX-BP ( IEKF) (3 iterations) were applied to this problem. We 
used the square-root implementations of the XBP and IX-BP in SSPACK_PC [10] and 
compared them to the sigma-point processor in REBEL [24]. The results are shown 
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FIGURE 6.4 Nonlinear trajectory estimation, (a) Trajectory (state) estimates using the 
XBP (thick dotted), IX-BP (thin dotted) and SPBP (thick solid), (b) Filtered measurement 
estimates using the XBP (thick dotted), IX-BP (thin dotted) and SPBP (thick solid), (c) Zero- 
mean/whitenesstestsforXBP(l .04 x 10 -1 <1.73x 10 -1 /l% out), (d) Zero-mean/whiteness 
tests for IX-BP (3.85 x 10 -2 < 1.73 x 10 _1 /0% out), (e) Zero-mean/whiteness tests for SPBP. 
(5.63 x 10- 2 < 1.73 x 10-' /0% out). 
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in Fig. 6.4 where we see the corresponding trajectory (state) estimates in a and the 
“filtered” measurements in b. From the figures, it appears that all of the estimates are 
quite reasonable with the sigma-point processor estimate (thick solid line) converging 
most rapidly to the true trajectory (dashed line). The XBP (thick dotted line) appears 
slightly biased while the IX-BP (thin dotted line) converges rapidly, but then wan¬ 
ders slightly from the truth. The measurements also indicate the similar performance. 
The zero-mean/whiteness tests confirm these observations. The XBP and IX-BP per¬ 
form similarly with respective zero-mean/whiteness values of: (1.04 x 10 -1 < 1.73 x 
10 _1 /1% out) and (3.85 x 10 -2 < 1.73 x 10 _1 /0% out), while the sigma-point 
processor is certainly comparable at (5.63 x 10 -2 < 1.73 x 10 _1 /0% out). This 
completes the example. AAA 


6.3.1 Extensions of the Sigma-Point Processor 

Recently there have been a number of developments in the nonlinear estimation area 
that are based on the sigma-point (or similar) transformation [29, 30, 32], Next 
we briefly mention these approaches keeping in mind that they are members of the 
“sigma-point family”. 

The central difference Bayesian processor ( CDBP ) or unscented Kalman filter 
( UKF) is based on the Stirling approximation interpolation formula that is essentially 
a second order Taylor series expansion of the nonlinear random function about its 
mean. The central differences are used to approximate the first and second order 
terms of the series. In this sense the processor implicitly employs the WSLR used 
to derive the SPBP as before. The resulting er-points are dependent on the half-step 
size, A x , rather than the other parameters discussed previously. The CDBP is slightly 
more accurate than the SPBP, and also has the advantage of only requiring a single 
parameter A x to adjust the spread of the cr-points. For more details, see [29]. 

Just as with the Kalman filter implementations [10], the SPBP also admits the 
numerically stable “square-root” forms for prediction and updated state covariance 
matrices. These methods are based on employing the QR decomposition and Cholesky 
updating. Again this approach offers a slightly more accurate processor as well as 
reduced computational costs while maintaining their numerical stability. See [29] for 
further details. 


6.4 QUADRATURE BAYESIAN PROCESSORS 

The grid-based quadrature Bayesian processor ( QBP ) or equivalently quadrature 
Kalman filter ( QKF) is another alternative er-point approach [30-32], It uses the 
Gauss-Hermite numerical integration rule as its basic building block to precisely 
calculate the sequential Bayesian recursions of Chapter 2 under the Gaussian 
assumptions. For scalars, the Gauss-Hermite rule is given 
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where equality holds for all polynomials,/!/), of degree up to 2 M — 1 and the quadra¬ 
ture (grid) points Ax,- with corresponding weights W, are determined according to the 
rule. The QBP uses the WSLR to linearize the nonlinear transformations through a 
set of Gauss-Hermite quadrature points rather than introducing the set of er-points. 
Details of this approach can be found in [32]. We briefly show the approach following 
the usual recursive forms of the BP of the previous chapter. For our nonlinear estima¬ 
tion problem, we have both nonlinear process (a[-]) and measurement (c[-]) equations 
that can be expressed in terms of the Gauss-Hermite rule above. 

Suppose we have the conditional mean and covariance at time (t — 1) based on all 
of the data up to the same time step. Then the corresponding conditional distribution 
is Pr(x(t — l)|7(_i) ~J\f(x(t — 1 |r — 1), P{t — l|f — 1)). The QBP is then given by the 
following set of recursive Bayesian equations: 

Xi = y/p(t-l\t-l) Ax,- + x(t - l\t - 1) (6.58) 

where VP is the matrix square root (Cholesky factor: P = \fp \[p). The prediction- 
step: Pr(x(t)|T;_i) ~2V(x(t|f — l),P(t\t— 1)) is given by 

M 

x(t\t - 1 ) = £ W ‘ a[x ‘t 

M 

P(t\t - 1) = £ W ‘ ( ' d[xi] - - ^*6*1 - x(t\t - 1))' + R ww (t - 1) (6.59) 

The update-step : Pr(x(?)| Y t ) J\f(x(t\t), P(t\t)) follows a similar development 

X; = -sjp(t\t- 1) AX; + x(t\t - 1) (6.60) 

and 

M 

x(t\t) = x(t\t - 1) + m(y(t) - y(t )) for y(t) = W iC [xi] 
i= 1 

Pm = P(t\t - 1) - mvxY (6.6i) 

where 

IC(t) = VxyPPyy + RwWr 1 

M 

Vxy = J2 w < Cx( P t ~ - cDt,-])' 

M 


Vyy = Wi(y(0 - c[x;])(y(t) - c[xj)' 


(6.62) 
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As with the SPBP, the QBP does not linearize the process or measurement models 
as in the case of the classical nonlinear processors. It calculates the weighted quadra¬ 
ture points in state-space over a fixed grid to estimate the unknown distribution (see 
[32-34] for more details). 

Another powerful approach to nonlinear estimation is the Gaussian sum (mixture) 
processor [36], which we discuss in more detail next. 


6.5 GAUSSIAN SUM (MIXTURE) BAYESIAN PROCESSORS 

Another general approach to the Bayesian processing problem is the Gaussian sum 
(G-S) approximation leading to a processor. It has been shown [4, 35, 36] that any 
non-Gaussian distribution can be approximated by a sum of Gaussian distributions, 
that is, 


N g N g N g 

p x (x) » ^ WipfOt) = WjJVlx : nJi), £*(()) for ^ W, = 1 and W, > 0 Vi 

(6.63) 

a mixture of Gaussian distributions with {W, } the set of mixing coefficients or weights. 
Clearly, this approach can be implemented with a bank of parallel classical processors 
( LZ-BP , XBP, IX-BP) or with any of the modern SPBP family just discussed in order 
to estimate the mean and variance of the individual A^-processors. In addition, the 
particle filtering algorithms to be discussed in the next chapter can also be incorporated 
into a Gaussian mixture framework—that is what makes this approach important [37]. 
Before we develop the processor, let us investigate some of the underlying properties 
of the Gaussian sum or equivalently Gaussian mixture ( G-M ) that makes it intriguing. 

The fundamental problem of approximating a probability distribution or density 
evolves from the idea of delta families of functions, that is, families of functions that 
converge to a delta or impulse function as the parameter that uniquely characterizes 
that family converges to a limiting value. Properties of such families are discussed 
in [35]. The most important result regarding Gaussian sums is given in a theorem, 
which states that a probability density function formed by 


Pxto = /: p(a)<5(x — a) da 


(6.64) 


converges uniformly to p x (x) [35], The Gaussian density forms a delta family with 
parameter a represented by 


S a (x)=M(x:0,a 2 ) = - 


(6.65) 


which satisfies the requirement of a positive delta family as a —*■ 0, that is, as the 
variance parameter approaches zero in the limit, the Gaussian density approaches a 
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delta function; therefore, we have that 

p x (x) = J p(a)Af a (x — a) da (6.66) 

This is the key result that forms the underlying basis of Gaussian sum or mixture 
approximations, similar to the idea of an empirical probability distribution 

Nx N g 

Pr(x) « J] WiS(x - x i) O Pr(x) J] W*Af(*/: n x (i), £*(»)) (6.67) 


as the variance a x —► 0. The Gaussian converges to an impulse located at its mean, 
ji x . The Gaussian mixture distributions have some interesting properties that we list 
below (see [35, 36] for more details): 

• Mean: /i g = VVifi x (i) 

• Covariance: i W,(S x (i) + n x (i)) 

• Skewness: y g = Y!!i\ 0 - %)(3E,(i) + (ji x (i) - fig) 2 x 2 ) 

• Kurtosis: K g = Y!il\ W,(3E 2 (i) + 6(/i x (i) - n g ) 2 Y, x (i) + /x g )^ g 2 - 3 


Next, let us see how this property can be utilized to develop a Bayesian processor. 
The processor we seek evolves from the sequential Bayesian paradigm of Sec. 2.5 
given by 

_ Pr(y(0W0) x Pr(x(t)|7,_i) 


Pr(x(t)| Y t ) = - 


Pr(y(0|T f - 


( 6 . 68 ) 


with the corresponding prediction-step 

PrWOin 3 = J Pr(x(f)Wf - 1)) x Pr (x(t - l)\Y, \)dx{l - 1) (6.69) 

Assume at time t that we approximate the predictive distribution by the Gaussian 
mixture 

N g 

Pr(x(t)|T r -i) » £mt ~ 1 W(x(t ): Ut),Pi(.t )) (6.70) 

Then substituting this expression into the filtering posterior of Eq. 6.68 we obtain 

Pr(XOWO) . 


Pr(x(f)|F r ) = 


Pr(y(t)|F r -i) 


< J 2 mt ~ : pm ( 6 . 71 ) 


For clarity in this development we constrain both process and measurement models 
to additive Gaussian noise sources 2 resulting in the nonlinear state-space approximate 


2 This is not necessary but enables the development and resulting processor relations to be much simpler. 
In the general case both noise sources can be modeled by Gaussian mixtures as well (see [35] for details). 
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GM representation of Sec. 4.8, that is, the process model 

x(t ) = a [x(t — 1)] + w{t — 1) w ~ Af(0,R ww (t)) (6.72) 

and the measurement model 

y(t) = c|a*(/)] + v(t) v ~ -AA(0, R w (t)) (6.73) 

with Gaussian prior, x(0)~J\f(x(0),P(0)). Applying the fundamental convergence 
theorem of Eq. 6.64, the Gaussian mixture distribution of the posterior can be 
approximated by 

N g 

Pr(x(?)| IV) ** Y Wi(tW(x(t) : x,(f), P,(0) (6.74) 

and converges uniformly in x(t) and y(t) as P,(t) —*■ 0 for N g . Therefore, 

{xi(t), Pi(t)} can be estimated from any of the classical (nonlinear) processors devel¬ 
oped in Chapter 5 or the modern er-point processor of this chapter. We choose to 
use the SPBP technique 3 of Table 6.1 to provide a “bank” of N g -parallel processors 
required to estimate each member of the Gaussian mixture such that 

m = mv» 

eiif $ = y(t ) - y,(t) 

Re iei (t) = g(Pi(t),Rvv(t )) 

x,(t) = Mt) + ICiiDeid) 

Pi(t ) = P,(t) - KiWemWiit) 

where/(•), g(-) are functions that are derived from the SPBP of Table 6.1 and the 
weights 4 (mixing coefficients) of the individual mixture Gaussians are updated 

vm = WO-DXWW:^.^) (6 75) 

Efii W,(t - i) X Mm) ■ W),K„M 

Once we have performed the parallel SPBP estimation using the Gaussian mixtures, 
we can estimate the statistic of interest from the posterior: the conditional mean as 

x(t\t) = E{x(t)\Y t } = J x(tmx(t)\Y t )dx(t) 

N g 

** Y w ‘ (t) / x{t)N{x{ty.x i (t),p i {t))dx{t) 


3 We symbolically use the SPBP algorithm and ignore the details (sigma-points, etc.) of the implementation 
in this presentation to avoid the unnecessary complexity. 

4 The detailed derivation of these expressions can be found in Alspach [36] and Anderson [4], Both 
references use the uniform convergence theorem to develop the posterior representation given above. 
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Now using the sifting property of the implied delta function, the Gaussian sum 
converges uniformly (Pi -*■ 0 &(■)) to give 

N g 

x(t\t) = (6.76) 


Defining the estimation error as 

x(t\t) := x(t)-x(t\t) 

and the corresponding error covariance as P(t\t ) := cov(x(t\t)) as before in Chapter 5, 
the approximated state error covariance is given by 

N g 

Pm = Wi(t)[Pi(t) + cov(x,(t) - x(t\m (6.77) 

(=1 

Thus, updating consists of a bank of N g -parallel SPBP to estimate the means and 
covariance {jc,-(f), P;(0} required for the conditional statistics of Eqs. 6.76 and 6.77. 
This set of relations constitute the update-step of the Gaussian sum processor. Next 
let us briefly develop the prediction-step. 

With the availability of the posterior Pr(x(f) | Y t ) and the process model of Eq. 6.72, 
the one-step prediction distribution can also be estimated as a Gaussian mixture. 
Using the filtering posterior and SPBP relations of Table 6.1, we have that 

Ng r 

Pr(x(t + 1)| Y t ) « J2 W J x(tW(x(t + 1): xfy + 1), Pi(t + 1)) (6.78) 

and using the N g -SPBP, we have 

m + 1) = a [*,(*)] 

Pi(t+ 1) = h(Pi(t)) + R ww (t) 

for h(-) a function of the SPBP parameters in Table 6.1. These relations lead to the 
one-step prediction conditional mean and covariance (as before) 

N g 

x(t + m = ]rwj(o*i(f + i) 

N g 

P(t + .1I0 = J2 Wi(.t)[Pi(t + 1 + cov(xj(t + 1) - x(t + l|t))] (6.79) 

to complete the algorithm. 

Next we consider the application of nonlinear processors to a tracking problem. 
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6.6 CASE STUDY: 2D-TRACKING PROBLEM 

In this section we investigate the design of nonlinear BP to solve a two-dimensional 
(2D) tracking problem. The hypothetical scenario discussed will demonstrate the 
applicability of these processors to solve such a problem and demonstrate the “basic” 
thinking behind constructing such a problem and solution. In contrast to the “bearings- 
only” problem discussed in Chapter 5, let us investigate the tracking of a large tanker 
entering a busy harbor with a prescribed navigation path. In this case the pilot on 
the vessel must adhere strictly to the path that has been filed with the harbor master 
(controller). Here we assume that the ship has a transponder frequently signaling 
accurate information about its current position. The objective is to safely dock the 
tanker without any incidents. We observe that the ship’s path should track the pre¬ 
scribed trajectory (Cartesian coordinates) shown in Fig. 6.5 which corresponds to the 
instantaneous AT-positions (versus time) shown. 

Our fictitious measurement instrument (e.g., low ground clutter phased array radar 
or a satellite communications receiver) is assumed to instantly report on the tanker 



FIGURE 6.5 Tanker ground track geometry for the harbor docking application: (a) Instan¬ 
taneous X-position (nm), (b) Instantaneous Y-position (nm), (c) Filed XY-path (nm), 
(d) Instantaneous bearing (deg), (e) Instantaneous range (nm). (f) Polar bearing-range 
track from sensor measurement. 
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position in bearing and range with high accuracy. The measurements are given by 
©(f) = arctan (—and R(t) = \/x 2 (t) ± Y 2 (t) 


We use the usual state-space formulation for a constant velocity model (see 
Sec. 5.5) with state vector defined in terms of the physical variables as x(t ) := 
[X(t ) Y(t) V x (t) Vy(t)\ along with the incremental velocity input as u' 
[~AV Xo - AVyJ. 

Using this information (as before), the entire system can be represented as a Gauss- 
Markov model with the noise sources representing uncertainties in the states and 
measurements. Thus we have the equations of motion 


x(t) = 


'1 0 
0 1 
0 0 
0 0 


AT 

0 

1 

0 


x(t - 1) 


~—AV Xo (t — 1)' 
_-AV yo (t-l)_ 


+ w(t - 1) 


(6.80) 


with the corresponding measurement model given by 




for w ~)V(0, R ww ) and v~AT(0,R vl ,). 

The SSPACK_PC software [23] was used to simulate this Gauss-Markov sys¬ 
tem for the tanker path and the results are shown in Fig. 6.6. In this scenario, 
we assume the instrument is capable of making measurements every AT = 0.02 hr 
with a bearing precision of ±0.02 degree and range precision of ±0.005 nm (or 
equivalently R vv = diag[4 x 10 -4 , 1 x 10 -4 ]. The model uncertainty was repre¬ 

sented by R ww = diag(l x 10 -6 ). An impulse-incremental step change in velocity, for 
example, V y going from —12 k to —4 k is an incremental change of ±8 k correspond¬ 
ing to AVy a = [8 1.5 —10.83 3.33] knots and AV^ = [0 4.8 —10.3 22.17] knots. 
These impulses (changes) occur at time fiducials of t = [0 1 3.5 7.5 8.1 9.1] hr 
corresponding to the filed harbor path depicted in the figure. Note that the velocity 
changes are impulses of height (AV*, AV y ) corresponding to a known deterministic 
input, u(t). These changes relate physically to instantaneous direction changes of the 
tanker and create the path change in the constant velocity model. 
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FIGURE 6.6 XBP (EKF) design for harbor docking problem (input known), (a) X-position 
and velocity estimates with bounds (0% out), (b) Y-position and velocity estimates with 
bounds (0% and 3% out), (c) Bearing and range estimates with bounds (1 % and 2% out), 
(d) Innovations zero-mean/whiteness tests for bearing (6 x 10 -4 < 26 x 10 _1 and 3% out) 
and range (2 x 10 -4 <42 x 10 -4 and 5% out). 


The simulated bearing measurements are generated using the initial condi¬ 
tions x'(0) := [20nm 50nm Ok —12k] and R ww = diag[l x 10 -6 ,1 x 10 -6 ] with 
the corresponding initial covariance given by P( 0) = diag[l x 10 -6 ,1 x 10 -6 ]. The 
Jacobian matrices derived from the Gauss-Markov model above are shown below: 


~ X2 (t) 
R 2 (i) 


0 O' 


*i(0 Xt (t) 

L R(t) R(t ) 


A[x]=A and C[x] = 


0 0 
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The XBP, IXBP, LZ-BP, and SPBP were executed under the constraint that all 
of the a priori information for the tanker harbor path is “known”. Each of the 
processors performed almost identically. A representative realization output for the 
XBP is shown in Fig. 6.6. In a and b we observe the estimated states X (~0% 
out), Y (~0% out), Y x (~0% out), V v (~3% out) 5 . Note that the velocities are 
piecewise constant functions with step changes corresponding to the impulsive incre¬ 
mental velocities. The filtered measurements: bearing (~1% out) and range (~2% 
out) are shown in Fig. 6.6c with the resulting innovations zero-mean/whiteness tests 
depicted in d. The processor is clearly tuned with bearing and range innovations zero- 
mean and white (6 x 10 -4 < 26 x 10 _4 /3% out) and (2 x 10 -4 < 42 x 10 _4 /5% out), 
respectively. This result is not unexpected, since all of the a priori informa¬ 
tion is given including the precise incremental velocity input, u{t). An ensemble 
of 101 realizations of the estimator were run by generating random initial con¬ 
dition estimates from the Gaussian assumption. The 101 realization ensemble 
averaged estimates closely follow the results shown in the figure with the zero-mean/ 
whiteness tests (2 x 10 -4 < 25 x 10 _4 /4% out), (2 x 10 -4 < 15 x 10 _4 /7% out) 
slightly worse. 

Next, we investigate the realistic case where all of the information is known a priori 
except the impulsive incremental velocity changes represented by the deterministic 
input. Note that without the input, the processor cannot respond instantaneously to 
the velocity changes and therefore will lag (in time) behind in predicting the tanker 
path. The solution to this problem requires a joint estimation of the states and now 
the unknown input which is a solution to the deconvolution problem [25], It is also a 
problem that is ill-conditioned especially, since u{t ) is impulsive. 

In any case we ran the nonlinear BP algorithms over the simulated data and the 
best results were obtained using the LZ-BP and SPBP. This is expected, since we 
used the exact state reference trajectories or filed paths, but not the input. Note that 
the other nonlinear BP have no knowledge of this trajectory inhibiting their perfor¬ 
mance in this problem. The results are shown in Fig. 6.7, where we observe the state 
estimates as before. We note that the position estimates appear reasonable, primar¬ 
ily because of the reference trajectories. The LZ-BP is able to compensate for the 
unknown impulsive input with a time lag as shown at each of the fiducials. The veloc¬ 
ity estimates (4% out, 1% out) are actually low-pass versions of the true velocities 
caused by the slower LZ-BP response even with the exact step changes available. 
These lags are more vividly shown in the bearing estimate of Fig. 6.7c which shows 
the processor has great difficulty with the instantaneous velocity changes in bearing 
(0% out). The range (0% out) appears insensitive to this lack of knowledge primar¬ 
ily because the XY-position estimates are good and do not have step changes like 
the velocity for the LZ-BP to track. Both processors are not optimal and the inno¬ 
vations sequences are zero-mean but not white (75 x 10“ 3 < 81 x 10 _3 /59% out), 
(2 x 1CT 3 < 4 x 10 _3 /8% out). 


1 Here “% out” means the percentage of samples exceeding the confidence bounds. 
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FIGURE 6.7 LZBP design for harbor docking problem (input unknown), (a) X-position and 
velocity estimates with bounds (68% and 4% out), (b) Y-position and velocity estimates 
with bounds (49% and 1% out), (c) Bearing and range estimates with bounds (0% and 
3% out), (d) Innovations zero-mean/whiteness tests for bearing (75 x 10 -3 < 81 x 10 -3 and 
59% out) and range (2 x 10 -3 < 4 x 10 -3 and 8% out). 


We also designed the SPBP (UKF) to investigate its performance on this problem 
and its results were quite good 6 as shown in Fig. 6.8. The processor does not perform 
a model linearization but a statistical linearization instead, it is clear from the figure 
shown that it performs better than any of the other processors for this problem. In 
Fig. 6.8a-d, we see that the XY position estimates “track” the data very well while 


6 We used noisier simulation data for this run than that for the LZ-BP with R vv = diag = [4 x 10 4 
5 x 10 -1 ] providing a more realistic measurement uncertainty. 
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the XY -velocity estimates are somewhat nosier due to the abrupt changes (steps) tun¬ 
ing values of the process noise covariance terms. In order to be able to track the 
step changes, the process noise covariance could be increased even further at the 
cost of nosier estimates. The SPBP tracks the estimated bearing and range reason¬ 
ably well as shown in figure with a slight loss of bearing track toward the end of 
the time sequence. These results are demonstrated by the zero-mean/whiteness test 
results of the corresponding innovations sequences. The bearing innovation statistics 
are: 3.3 x 10 -3 < 1.2 x 10 _1 and 4.7% out and the corresponding range innovation 
statistics given by: 6.5 x 10“ 3 < 1.2 x 10 _1 and 4.7% out. Both indicate a tuned 
processor. These are the best results of all of the nonlinear processors applied. This 
completes the case study. 

It is clear from this study that nonlinear processors can be “tuned” to give reason¬ 
able results, especially when they are provided with accurate a priori information. If 
the a priori information is provided in terms of prescribed reference trajectories as in 
this hypothetical case study, then the SPBP appears to provide superior performance, 
but in the real-world tracking problem (as discussed in Sec. 5.5) when this information 
on the target is not available, then the XBP and IX-BP can be tuned to give reasonable 
results. 

There are many variants possible for these processors to improve their performance 
whether it be in the form of improved coordinate systems [26, 27] or in the form of 
a set of models, each with its own independent processor [8], One might also con¬ 
sider using estimator/smoothers as in the seismic case [7] because of the unknown 
impulsive input. In any case, this is a challenging problem that much work has been 
accomplished, the interested reader should consult Ref. [8] and the references cited 
within. 


6.7 SUMMARY 

In this chapter we have developed the “modem” sigma-point Bayesian processor 
(SPBP) or equivalently, the unscented Kalman filter ( UKF ), from the basic princi¬ 
ples of weighted linear stochastic linearization (WSLR) and cr-point transformations 
( SPT ). We extended the results for multivariate Gaussian distributions and calculated 
the corresponding a -points and weights. Once determined, we developed the SPBP 
by extrapolating the WSLR and SPT approaches coupled to the usual Kalman fil¬ 
ter recursions. Grid-based quadrature and Gaussian sum ( G-S) processors were also 
discussed and developed using the SPBP formulation to demonstrate a distribution 
approximation approach leading to the particle filter formulation of the next chap¬ 
ter. Examples were developed throughout to demonstrate the concepts ending in a 
hypothetical investigation based on tracking a tanker entering a busy harbor. We sum¬ 
marized the results by applying some of the processors to the case study implementing 
a 2D-tracking filter. 
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MATLAB NOTES 

SSPACK_PC is a third-party toolbox in MATLAB that can be used to design model- 
based signal processors [10, 23], This package incorporates the major nonlinear 
MBP algorithms discussed in this chapter—all implemented in the t/D-factorized 
form [38] for stable and efficient calculations. It performs the discrete approxi¬ 
mate Gauss-Markov simulations using (SSNSIM) and both extended ( XMBP ) and 
iterated-extended ( IX-MBP ) processors using (SSNEST). The linearized model- 
based processor ( LZ-MBP ) is also implemented (SSLZEST) . Ensemble operations 
are seamlessly embodied within the GUI-driven framework where it is quite effi¬ 
cient to perform multiple design runs and compare results. Of course, the heart 
of the package is the command or GUI-driven post-processor (SSPOST) which 
is used to analyze and display the results of the simulations and processing (see 
http://www.techni-soft.net for more details). 

REBEL is a recursive Bayesian estimation package in MATLAB available on 
the web, that performs similar operations including the new statistical-based 
unscented algorithms including the UKF including the unscented transformations 
[24], It also has included the new particle filter designs as discussed in [28] (see 
http://choosh.ece.ogi.edu/rebel for more details). 
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PROBLEMS 

6.1 Let x\ and X 2 be i.i.d. with distribution J\f( 0,1). Suppose y = x\ + x\, then 

(a) What is the distribution of y, py(y)? 

(b) Suppose E{y} = 2 and 07 = 4, using the sigma point transformation what 
are the sigma-points for x = [xi X2]'? 

(c) What are the sigma-points for y? 

6.2 Suppose x ~ M( 0,1) and y = x 2 , 

(i a ) What is the distribution of y, p y (y)? 

(b) What is the Gaussian approximation of the mean and variance of p y (y)? 
(Hint: Use linearization) 

(c) What is the sigma-point transformation and corresponding mean and 
variance estimates of Py(y)? 

6.3 From the following set of nonlinear system models, develop the SPBP 
algorithm for each: 

(a) Synchronous (unsteady) motor: x(t) + Cx(t) + p sin x(t) = L(t) 

(b) Duffing equation: x(t') + ax(t) + fix 3 (t) = F cos cot) 

(c) Van der Pol equation: x(t) + ex(t)[ 1 — x 2 (t)] + x(t) = m(t) 

(d) Hill equation: x(t') — ax(t) + fip(t)x(t) = m(t) 

6.4 Suppose we are given the following discrete system 

x(t ) = -co 2 x(t - 1) + sin(x(t - 1)) + au(t - 1) + w(t - 1) 
y(t) = x(t) + v(t) 


with w and v zero-mean, white Gaussian with usual covariances, R ww and 
R vv . Develop the SPBP for this process. Suppose the parameters co and a are 
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unknown, develop the SPBP such that the parameters are jointly estimated 
along with the states. (Hint: augment the states and parameters to create a new 
state vector). 

6.5 The Mackey-Glass time delay differential equation is given by 

y(t) = x(t) + v(t) 

where a, P are constants, AMs a positive integer with w and v zero-mean, 
white Gaussian with covariances, R ww and R vv . For the parameter set: a = 0.2, 
p = 0.1, r = 7 and N = 10 with x(0) = 1.2 (see [21, 29]). 

Develop the SPBP for this process. 

6.6 Assume that a target is able to maneuver, that is, we assume that the target 
velocity satisfies a first-order AR model given by: 

v r (t) = -av r (t - 1) + w r (t - 1) for w ~ AfiO,R WrWr ) 

(a) Develop the Cartesian tracking model for this process. 

(b) Develop the corresponding SPBP assuming all parameters are known 
a priori. 

(c) Develop the corresponding SPBP assuming a is unknown. 

6.7 Nonlinear processors can be used to develop neural networks used in many 
applications. A generic neural network behavior is governed by the following 
relations: 

x(t) = x(t - 1) + w(t - 1) 
yit) = c[xit), uit ), ait)] + vit) 

where x(t) is the network weights (parameters), u(t) is the input or training 
sequence, ait ) is the node activators with w and v zero-mean, white Gaussian 
with covariances, R ww and R vv . 

Develop the SPBP for this process. 

6.8 Consider the problem of estimating a random signal from an AM modulator 
characterized by 

s(t) = sflPait) sin oj c t 
r(t) = sit ) + vit) 

where ait) is assumed to be a Gaussian random signal with power spectrum 


Saaico) 


2 k a P a 
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also assume that the processes are contaminated with the usual additive noise 
sources: w and v zero-mean, white Gaussian with covariances, R ww and 
R vv . The discrete-time Gauss-Markov model for this system was developed 
(chapter 5 problems) from the continuous-time representation using first 
differences, then: 

(a) Develop the SPBP. 

( b ) Assume the carrier frequency, co c is unknown. Develop the SPBP for this 
process. 

6.9 Develop the Gaussian sum ( G-S ) processor algorithm using the XBP instead 
of the SPBP. In the literature, this is the usual approach that is used. How does 
the overall algorithm differ? What are the apparent pitfalls involved in this 
approach compared to the SPBP approach? 

6.10 We are given a sequence of data and know it is Gaussian with an unknown 
mean and variance. The distribution is characterized by y ~ N(n, a 2 ). 

(a) Formulate the problem in terms of a state-space representation. (Hint: 
Assume that the measurements are modeled in the usual manner (scale by 
standard deviation and add mean toaA/XO, l)“known” sequence, say v(t)). 

(b) Using this model, develop the SPBP technique to estimate the model 
parameters. 

(c) Synthesize a set of data of 2000 samples at a sampling interval dt = 0.01 
with process covariance, R ww = diag[l x 10 -5 , 1 x 10 -6 ] and measure¬ 
ment noise of R vv = 1 x 10 -6 with x(0) = [\/20 3]'. 

(d) Develop the SPBP algorithm and apply it to this data, show the per¬ 
formance results (final parameter estimates, etc.). That is, the find the 
best estimate of the parameters defined by © := [fj, a]' using the SPBP 
approach. Show the mathematical steps in developing the technique and 
construct simple SPBP to solve. 




7 


PARTICLE-BASED BAYESIAN 
STATE-SPACE PROCESSORS 


7.1 INTRODUCTION 

In this chapter we develop particle-based processors using the state-space represen¬ 
tation of signals and show how they evolve from the Bayesian perspective using 
their inherent Markovian structure along with importance sampling techniques as 
our basic construct. Particle filters offer an alternative to the Kalman model-based 
processors discussed in the previous chapters possessing the capability not just to 
characterize unimodal distributions but also to characterize multimodal distributions. 
We first introduce the generic state-space particle filter ( SSPF ) and investigate some 
of its inherent distributions and implementation requirements. We develop a generic 
sampling-importance-resampling (SIR) processor and then perhaps its most popular 
form—the “bootstrap” particle filter. Next we investigate the resampling problem and 
some of the more popular resampling techniques also incorporated into the bootstrap 
filter from necessity. The bootstrap and its variants are compared to the classical and 
modern processors of the previous chapters. Finally, we apply these processors to 
a variety of problems and evaluate their performance using statistical testing as part 
of the design methodology. 


7.2 BAYESIAN STATE-SPACE PARTICLE FILTERS 

Particle filtering (PF) is a sequential Monte Carlo method employing the sequential 
estimation of relevant probability distributions using the “importance sampling” tech¬ 
niques developed in Chapter 3 and the approximations of distributions with discrete 
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random measures [1-4]. The key idea is to represent the posterior distribution by 
a set of N p random samples, the particles, with associated weights, {x;(f), W;(f)}; 
i= 1,..., N p , and compute the required Monte Carlo estimates. Of course, as the 
number of particles becomes very large the MC representation becomes an equiva¬ 
lent characterization of the analytical description of the posterior distribution (e.g., 
see Ex. 3.15 which converges to the optimal Bayesian estimate). 

Thus, particle filtering is a technique to implement sequential Bayesian estima¬ 
tors by MC simulation. It offers an alternative to approximate Kalman filtering for 
nonlinear problems [1,5], In PF, continuous distributions are approximated by “dis¬ 
crete” random measures composed of these weighted particles or point masses where 
the particles are actually samples of the unknown or hidden states from the state- 
space representation and the weights are the “probability masses” estimated using 
the Bayesian recursions as shown in Fig. 7.1. From the figure we see that associated 
with each particle, x,-(f) is a corresponding weight or (probability) mass, W;(f) (filled 
circle). Knowledge of this random measure, [xi(t), W,(t)} characterizes an estimate 



Pr(X,l %)= Z W,(t)d(X(t) - x ; (f)) 




/!>• 
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FIGURE 7.1 Particle filter representation of posterior probability distribution in terms of 
weights (probabilities) and particles (samples). 
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of the instantaneous (at time t) filtering posterior distribution (solid line), 

N p 

Pr(x(t)|7 f ) ~ J2 W/(f)a(x(f) ~ *i(0) 


We observe from the figure that the particles need not be equally-spaced or conform 
to a uniform grid and that they tend to coalesce in high probability regions ( HPR ). 

Importance sampling plays a crucial role in state-space particle algorithm devel¬ 
opment. The PF does not involve linearizations around current estimates, but rather 
approximations of the desired distributions by these discrete random measures in 
contrast to the Kalman filter which sequentially estimates the conditional mean and 
covariance used to characterize the (Gaussian) filtering posterior, Pr(x(t)| F r ). Particle 
filters are a sequential MC methodology based on “point mass” representation of prob¬ 
ability distributions that only require a state-space representation of the underlying 
process. This representation provides a set of particles that evolve at each time-step 
leading to an instantaneous approximation of the target posterior distribution of the 
state at time t given all of the data up to that time. Figure 7.2 illustrates the evolution 
of the posterior at each time step. Here we see the estimated posterior at times t \, tj 



Particle no. [Y,(r)J 


FIGURE7.2 Particle filter surface (X; vs. tvs. Pr(X/(f) | Y /)) representation of posterior prob¬ 
ability distribution in terms of time (index), particles (samples) and weights or probabilities 
(left plot illustrates extracted MAP estimates vs. t). 
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to t( } creating an instantaneous approximation of t vs Xj vs. Pr(x(?)| Y,). Statistics are 
calculated across the ensemble at each time-step to provide estimates of the states. For 
example, the minimum mean-squared error ( MMSE ) estimate is easily determined by 
averaging over x ( , since 

*mmse(0 = J x(t)Vr(x(t)\ Y t ) dx « J x(t)Vr(x(t)\Y,)dx 

= J x(t) -*■(*)) j dx = jyVi{t)xi{t) 

The maximum a posteriori (MAP) estimate is simply determined by finding the sample 
corresponding to the maximum weight of Xj(t) across the ensemble at each time-step 
(as illustrated in Fig. 7.2), that is, 

■XmapCO = argmaxPr(x(t)|F r ) (7.1) 


In Bayesian processing, the idea is to sequentially estimate the posterior, Pr (x(t — 
l)|F ( _i)—>• Pr(x(t)| Y t ). Recall that optimal algorithms “exactly” track these distri¬ 
butions; however, they are impossible to implement because the updates require 
integration that cannot usually be performed analytically. Recall that the batch 
joint posterior distribution follows directly from the chain rule under the Markovian 
assumptions on x(t) and the conditional independence of y(t), that is, 


Vr(X,\Y t ) = ["[PrWOWi- 1)) x Pr(y(j)|*(0) forPr(x(0)):=Pr(x(0)|x(-l)) 

'=° (7.2) 

The corresponding sequential importance sampling solution to the Bayesian esti¬ 
mation problem was given generically in Eq. 3.59 starting with the recursive form for 
the importance distribution as 


q(X t \Y t ) = q(X t -\\Y t -i) x q(x(t)\X t . u Y t ) 
leading to the sequential expression for the importance weights as 
Pv(X,\Y t ) _ VdY t \X t ) x Pr(X t ) 


W(t)<x 


q(X t \Y t ) q(X t -\\Y t -\) x q(x(t)\X t - lt Y t ) 


W(t) = W(t - 1) 


Likelihood Transition 

Swoila)) x Px(x(t)\x(t - 1)) 



7.2 BAYESIAN STATE-SPACE PARTICLE FILTERS 241 


Similarly the state-space particle filter ( SSPF) evolving from this sequential 
importance sampling construct follows directly. Recall that the generic state-space 
characterization representing the transition and likelihood probabilities is: 

Pr(x(t)\x(t - 1)) A(x(t)\x(t - 1)) 

Pr(y(0Wt)> O C(y(t)\x(t)) 

which leads to an alternate representation of Eq. 7.2 


Vr(X t \Y t ) = [~[ A(x(i)\x(i - 1)) x C(y(i)\x(i)) 


for Ax(0)|x(-1)) := Pr(x(0)) 

(7.4) 


Thus the generic state-space sequential importance sampling solution is given by 


Xi (t) ~ q(x(t)\X t - U Y t ) 

Wit) = Wit - 1) x Cjy^xm^Alxmit-l)) 
q{Xi(t)\X t _ x {i),Y t ) 


Wit) 




(7.5) 


where the sample at time t, xfit) is referred to as the population or system of particles 
and X t (i) for the i ,h -particle as the history (trajectory or path) of that particular particle 
[6]. It is important to note that a desired feature of sequential MC sampling is that the 
/V ; ,-particles of X t (i) are i.i.d. 

In the practical application of the SSPF algorithm, we can obtain samples from 
the posterior by augmenting each of the existing samples with the new state draw, 
that is, 


X,(i) = {Xi{t\ *,_!©} 

where Xj(t) ~ q(x(t)\X t - U Y t ) and X t -i(i) ~ q{X t -i\Y t -i). 
Now if we further assume that 


q(pc(t)\X,- U Y t ) -> q(x(t)\x(t - \),y{t)) (7.6) 

then the importance distribution is only dependent on \x{t — 1), y(t)] which is common 
when performing filtering, Pv(x(t)\Y t ), at each instant of time. 
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Assuming this is true, then the SSPF with x/(t) —>• X t (i) and y(t) -> Y, recursion 
becomes 


Xi(t) 

Wi(t ) = Wi(l - 1) 
W{(f) 


q(x(t)\x(t - \),y(t)) 

C(y{t)\xj{t)) x A(x(t)\ Xi (t - 1)) 
qixiiDMt - l),y(0) 


mt) = - 


TZi w m 

and the filtering posterior is estimated by 


N p 

Pr(x(OI Y t ) « Y. W ‘ (t > x S(x(t > ~ *iW) (7-8) 

We summarize the generic SSPF in Table 7.1. Note that as N p becomes large, in 
the limit, we have 

J\mJ>xix(t)\Y t ) -> Pr(.r(t)| Y t ) (7.9) 

which implies that the Monte Carlo error decreases as the number of particles 
increase. 


7.3 IMPORTANCE PROPOSAL DISTRIBUTIONS 


Selection of the importance distribution is a critical part of the design phase in particle 
filtering. Besides assuring that the distribution “covers” the posterior, there are a 
number of properties that can also be satisfied to achieve a robust design. 


7.3.1 Minimum Variance Importance Distribution 

Unfortunately, the generic algorithm presented in the previous section has a serious 
flaw, the variance of the importance weights increases over time [4, 6]. Therefore, 
the algorithm degenerates to a single non-zero weight after a few iterations. One way 
to limit this degeneracy is to choose an importance distribution that minimizes the 
weight variance based on the available information, [X t -\ , T r ]. That is, we would like 
the solution to the problem of minimizing the variance, V 9 (4r)|x,_i,Y,)W'(0), with 
respect to q{-) such that 


V q (Wi(t)) = Wf(t ) 


[/ 


(Pr(y(?) ))Pr(^(Q | t (Q» 2 

q(.x{t)\Xf—\{i), Y t ) 


dx{t) - Pr {y{t)\X t _m 2 


is minimized. It has been shown [4] that the minimum variance importance 
distribution that minimizes the variance of the set of weights, {W,-(f)} is given by: 


qMv(x(t)\X,-uY t ) -* Pr(x(t)|x(t-l),y(t)) 


(7.10) 
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TABLE 7.1 Generic State-Space Particle Filtering Algorithm 


Initialize 

Xi(0 )-> Pr(jt(0)); Wf(0) = ^-; i — \ ,... ,N P [sample] 

Importance Sampling 

Xiit) ~ A(x{t)\xj{t — 1)) [state transition] 

State-Space Transition 
A(x(t)\xi(t - 1)) <= A(x(t - 1), u{t - 1), Wilt - 1)); 

Wi ~ Pr(iy,(f)) [transition] 


Measurement Likelihood 
C(y(t)\ Xi (t )) <= C(Mt), «(0. K0); v t ~ Pr(u(f)) 
Weight Update 

, C(y(t)\xm x -A(x(0l*,(f - 1)) 


Wf(?) = Wi(/ — 1): 


<7(A(t)lA(t - 1),X0) 

Weight Normalizatii 




W,(t) 


E"=i 

Distribution 

Pr(x(t)\Y t )* J W,««5W0-*,-(!)) 


[weight normalization] 


[posterior distribution] 


Stole Estimation (Inference) 
x(t\t)=E{x(t)\Y ,}« EW.(0.v,(0 
XMAp(t)= argmaxPr(x(f)IU) 

Xmh>( 0 = median{Pr(x(f)IU)} 


[conditional mean] 
[maximum a-posteriori] 
[median] 


Let us investigate this minimum variance proposal distribution in more detail 
to determine its realizability. Using Bayes’ rule we can decompose 1 this proposal 
distribution in steps with A = x(t) and B =x(t — 1), y(t): 


Vr(x(t)\x(t - l),y(0) = 


Pr(x(t - \),y(t)\x(t)) x PrfaU)) 
Pr(x(t — l),y(f)) 


(7.11) 


1 We apply the following form of Bayes’ rule: P:(A\B) = P:(B\A) x Pr(A)/Pr(B). 
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but the first term in the numerator can be decomposed further to obtain 


Pr (x(t - ] ),y(t)\x(t)) = Pv(y(t)\x(t - \ ),x(t)) x Pr (x{t - l)|x(t)) 


= Vx(y(t)\x{t- l),x(t)) 


•Pr(x(t)lx(t - l))Pr(x(f - 1)) 
Pr(x(t)) 


(7.12) 


Substituting this relation into Eq. 7.11 and canceling the Pr(x(t)) terms, we obtain 

D Pr(Y(0 Ht - 1 ),x(t)) x Pr(x(t)|x(t - 1)) x Pr(x(t - 1)) 

Pr(x(t)|x(t - l),j(t)) = - — -——- 

Pr(x(f - l),Y(t)) 


Expand the denominator in Eq. 7.13 using Bayes’ rule. 


(7.13) 


Pr (x(t - Y),y(t)) = P\iy(t)\x(t - 1)) x Pr(x(t - 1)) 


substitute, cancel the Pr(x(? — 1)) terms and apply the conditional independence 
assumption of the measurements on past states, that is, 

Pr(y(f)Wf-l),x(0) -* Pr(y(0W0) 


to obtain the final expression for the minimum variance proposal distribution as 
Pr(y(0M0) x Pr(x(t)\x(t - 1)) 


q MV (x(t)\X t - U Y t ) = Pr(x(t)\x(t - \ ),y(t)) = - 


Pr(y(t)|x(t - 1)) 


If we substitute this expression for the importance distribution in the weight 
recursion of Eq. 3.63, then we have 

= W(f _ 1} x PrCy(0|x(Q)xPr ( x(t)|x(t-l)) 

qmWtm-uYt) 


W{t) = W(t - \)Pr(y(t)\x(t))Pr(x(t)\xU - 1)) 


c f Pr(y(t)\x(t - 1)) 

C [ Pv(y(t)\x(t)) x Pr(x(t)\x(t 


- 1 )). 


Canceling like terms and applying the Chapman-Kolmogorov relation, we obtain the 
corresponding minimum variance weights 


Wuvit) = W M vit - 1) X Pr(y(f)|x(f - 1)) = W MV (t - 1) 

x J C(y(t)\x(t)) x A{x{t)\xit - 1 ))dx{t) (7.15) 

which indicates that the importance weights can be calculated before the particles 
are propagated to time t. From this expression we can also see the problem with 
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the minimum variance importance function approach: (1) we must sample from 
Vx{x(t)\x(t — 1 ),y(t)y, and (2) we must evaluate the integral which generally has no 
analytic form. 


7.3.2 Transition Prior Importance Distribution 

Another choice for an importance distribution is the transition prior. This prior is 
defined in terms of the state-space representation by A{x(t)\x{t — 1)) <= A(x(t — 1), 
«(f — 1), w(t — 1)) which is dependent on the known excitation and process noise 
statistics and is given by 


q pnor (x(t)\x(t- \ ),Y t ) -* Pr(x(t)\x(t-\ )) 


Substituting this choice into the expression for the weights of Eq. 7.3 gives 


Wi(t) = Wi(t - 1) x 


Pr(y(0l*,-(0) x Pr(x(t)|x,(f - 1)) 

q pnor (x(t)\ Xl {t - 1), Y t ) 


= Wj(t - 1) x C(y(t)\ Xi (t)) 


(7.16) 


since the priors cancel. 

Note two properties for this choice of importance distribution. First, the weight 
does not use the most recent observation, y(t) and second it does not use the past 
particles x,(f — 1)) but only the likelihood. This choice is easily implemented and 
updated by simply evaluating the measurement likelihood, C(y(t)|x,(t)); i= \,... ,N p 
for the sampled particle set. In contrast to the minimum variance choice, these weights 
require the particles to be propagated to time t before the weights can be calculated. 

This choice of importance distribution can lead to problems, since the transition 
prior is not conditioned on the measurement data, especially the most recent. Fail¬ 
ing to incorporate the latest available information from the most recent measurement 
to propose new values for the states leads to only a few particles having significant 
weights when their likelihood is calculated. The transition prior is a much broader 
distribution than the likelihood indicating that only a few particles will be assigned 
a large weight. Thus, the algorithm will degenerate rapidly and lead to poor per¬ 
formance especially when data outliers occur or measurement noise is small. These 
conditions lead to a “mismatch” between the prior prediction and posterior distribu¬ 
tions. Techniques such as the auxiliary particle filter [2,7,8] as well as local linearized 
particle filters [4, 6, 9] have been developed that drive the particles to regions of high 
likelihood by incorporating the current measurement. Thus, the SSPF algorithm takes 
the same generic form as before with the minimum variance approach; however, we 
note that the importance weights are much simpler to evaluate with this approach 
which has been termed the bootstrap PF, the condensation PF, or the survival of the 
fittest algorithm. We summarize the bootstrap particle filter algorithm in Table 7.2 to 
follow. 
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7.4 RESAMPLING 

The main objective in simulation-based sampling techniques is to generate i.i.d. sam¬ 
ples from the targeted posterior distribution in order to perform statistical inferences 
extracting the desired information. Thus, the importance weights are quite critical 
since they contain probabilistic information about each specific particle. In fact, 
they provide us with information about “how probable a sample drawn from the tar¬ 
get posterior has been” [10,11], Therefore, the weights can be considered acceptance 
probabilities enabling us to generate independent (approximately) samples from the 
posterior, Pr(x(t)| Y t ). Recall that the empirical distribution, Pr(x(t)| Y,) is defined over 
a set of finite ( N p ) random measures, {x,(t), W,(f)j; i= l,... ,N p approximating the 
posterior, that is, 

N p 

Pr(x(f)l Y t ) » WMSMt) - Xi{t)) (7.17) 

One of the major problems with importance sampling algorithms is the depletion 
of the particles. The degeneracy of the particle weights creates a problem that must be 
resolved before these algorithms can be of any pragmatic use. It occurs because the 
variance of the importance weights increases in time [4] thereby making it impossible 
to avoid this weight degradation. Degeneracy implies that a large computational effort 
is devoted to updating particles whose contribution to the posterior is negligible. This 
approach is bound to fail in the long run, since the weight degeneration leads to a few 
particles containing most of the probability mass. Thus, there is a need to somehow 
resolve this problem to make the sequential simulation-based techniques viable. This 
requirement leads to the idea of “resampling” the particles. 

Resampling involves sampling A),-draws from the current population of particles 
using the normalized weights as selection probabilities. The resampling process is 
illustrated in Fig. 7.3. Particles of low probability (small weights) are removed and 
those of high probability (large weights) are retained and replicated. Resampling 
results in two major effects: (1) the algorithm is more complex and is not merely the 
simple importance sampling method; and (2) the resampled trajectories, X t (J) are no 
longer i.i.d. and the normalized weights are set to 1 /N p . 

Resampling, therefore, can be thought of as a realization of enhanced parti¬ 
cles, £*(0, extracted from the original samples, x,(f) based on their “acceptance 
probability”, W,(t), at time t. Statistically, we have 

Pr mt) = xi(t )) = Wilt) for i = I,... ,N p (7.18) 

or we write it symbolically as 


hit) =>• x,(t) 

where “=>” defines the resampling operator generating a set of new particles, {xk(t)}, 
replacing the old set, {x,(0}- 
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Importance san 



Target PDF 


Weights 0 


- oOo° - 


Resampling 


• • • 

©DO 


FIGURE 7.3 Resampling consists of processing the predicted particles with their associ¬ 
ated weights (probabilities), duplicating those particles of high weight (probability) and 
discarding those of low weight. 

The fundamental concept in resampling theory is to preserve particles with large 
weights (i.e., large probabilities) while discarding those with small weights. Two 
steps must occur to resample effectively: (1) a decision, on a weight-by-weight basis, 
must be made to select the appropriate weights and reject the inappropriate; and 
(2) resampling must be performed to minimize the degeneracy. This overall strategy 
when coupled with importance sampling is termed sequential sampling-importance- 
resampling (SIR) [4], 

A reasonable measure of degeneracy is the effective particle sample size based on 
the coefficient of variation [12] defined by 


fVn An 

■— _-_ = _ - _ < N„ 

31 E q {W 2 (X t )} 1 + V q (W(X,)) - p 


(7.19) 


An estimate of the effective number of particles at time t is given by 



(7.20) 


and a decision based on the rejection method [13] is made by comparing it to a 
threshold, Nthresh- That is, when N e ff(t) is less than the threshold, resampling is 
performed. 


Neffit) = 


< Nthresh Resample 
> Nthresh Accept 
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FIGURE 7.4 Resampling (with replacement) by inverse transforming a uniform sampler 
to generate samples from target distribution. 


Once the decision is made to resample, a uniform sampling procedure [14] can 
be applied removing samples with low importance weights and replicating samples 
with high importance weights. Resampling (with replacement) generates a set of new 
samples with uniform weights from an approximate discrete posterior distribution, 

N p 

Pr(x(f)| 7 ( ) « J2 Wi(mx(t) - Xi(t )) (7.21) 


so that Pr[x,(f) = Xj(t)] = Wj(t). The resulting independent and identically distributed 
sample from the probability mass of Eq. 7.21 is uniform such that the sampling 
induces the mapping of {x;(f), VV,(t)} {x/(r), W,(t)}, VV,(t) = l/N p Vi. The selec¬ 

tion of Xj(t) = Xj(t) is shown in Fig. 7.4. Here the procedure is termed systematic 
resampling [4], For each resampled particle, that is, /V,-times is related to the original 
particle. The methodology relies on uniformly sampling the cumulative distribu¬ 
tion function resulting in the new replicated particles or uniform weighting. The 
resampling algorithm is incorporated in Table 7.2. 

Resampling decreases the degeneracy problem algorithmically, but introduces its 
own set of problems. After one resampling step, the simulated trajectories are no 
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TABLE 7.2 Bootstrap SIR State-Space Particle Filtering Algorithm 


Initialize 

x,(0)~Pr(jc(0)) i=l,...,N p 

Importance Sampling 

Xi(t ) ~ A(x(t)\xi(t - 1 ))<=A(x(t - 1), u(t - 1), Wi(t - 1)); 


[sample] 


[state transition] 


Weight Update 

wAt) = cmm)) <= cm, m, vm v & wm) 


YZi ^(o 


Weight Normalization 


Resampling Decision 

NeS &mA 1 '•> - 

^ _ | Resample < Nthresh 
| Accept > N t l, res h 


[weight/likelihood] 


[effective samples] 


m 


■Xi(t) 


Resampling 


Distribution 

Pr(x(t)\Y,)~ J W,(/)«(*« - Xi(t)) 


[posterior distribution] 


longer statistically independent. Therefore, the simple theoretical convergence results 
under these assumptions lose their validity. Pragmatically, resampling can limit algo¬ 
rithm parallelization because combining particles causes an increase in computational 
complexity. Also there is a possible loss in diversity caused by replication of those 
particles with highest importance weights [4], Thus as with any methodology there 
are tradeoffs that must be considered. 

7.4.1 Multinomial Resampling 

There are a variety of techniques available to implement the basic resampling method 
[1,6,16,19], The usual approach is to resample with replacement, since the probabil¬ 
ity of each particle x,(t) is given by the normalized weight Wj(t). Therefore, the number 
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of times Nj that each particular particle in the original set {x,(f)J is selected follows 
a binomial distribution, BinW ; ,, Wj(t)). The corresponding vector, [N \,... ,AfyJ is 
distributed according to a multinomial distribution with parameter, N p and probability 
of success [VTi(f),..., Wu p {t)]. With this resampling scheme, particles in the original 
set with small variance weights are most likely discarded, while those of high weights 
are replicated in proportion to these weights. The multinomial resampling method is 
given by: 

• Given a random measure at time t, that is, a set of particles and weights, 
{xi(t),Wm,i=\,...,N p -, 

• Sample uniformly, u k —> U(0, 1); k = 1,..., N p ; 

• Determine the index, i k : 4 = k for Pr (xj k (t) = x k {t )) = u k \ 

• Select a new sample, Xi k (t) =>Xi(t) and weight, Wj k (t) = based on the new 
sample index, 4; and 

• Generate the new random (resampled) measure: {xj k , Wi k (t)}; for 4=1,... , N p . 

Here the index notation, 4 designates the original i th -particle or parent and the new k th - 
particle using the inverse CDF method of Sec. 3.3. This sampling scheme is equivalent 
to drawing, 4; k= 1,..., N p samples from a multinomial distribution with parame¬ 
ters, M{Ni k ,Wi k {t)) and corresponding statistics: mean, E{Nj k } = N p and variance, 

Y-MN lk ) = N p W lk (t)(\ - W ik (t)y 

The basic idea is to first construct the CDF from the original random measure, 
{xi(t), Wi{t)}, since it is given by 


N p 

Pr(Z(t) < Xi (f)) * ^ Wi(tMx(t) - Xi {t)) (7.22) 


where //(■) is the unit step function. 

Uniform samples, u k , are drawn from the interval [0,1] and projected onto the 
inverse CDF (see Sec. 3.3) corresponding to the associated probability and identifying 
the particular new particle sample index, 4, and corresponding replacement particle, 
Xi k {t) leading to the resampling 


%(0 


Xi(t) 


(7.23) 


Clearly those particles or samples with highest probability (or weights) will be selected 
more frequently, thereby, replacing particles with lower probability (weights) and 
therefore, the new random measure is created, that is, 


{%«, W lk (t)} =» [Xi(t), Wi(t)} with W lk (t) = — 


(7.24) 
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This resampling technique represents a direct implementation of random sampling 
by generating an i.i.d. sample from the empirical posterior distribution 

N p 

Pr(x(t)| Y t ) » J2 W MS(x(t) - Xi(t)) 

N p ^ N p 

-> E w k mx(t) - x ik (t)) =—^2 s(x(t) - %(?)) 

i i 

(7.25) 

A second more efficient way of generating this measure is the “systematic” 
resampling method. 

7.4.2 Systematic Resampling 

The systematic resampling method is based on an ordered technique in which a set 
of Ap-ordered uniform variates are generated [20]. It minimizes the error variance 
between the original selected sample and its mean. Thus, the systematic sampling 
method is given by: 

• Given a random measure at time t, that is, a set of particles and weights, 
{xi(t),Wi(t)},i=h...,N p -, 

• Sample uniform /Vp-ordered variates: m* = Uk + for k= 1,..., N p and 
u k ^U( 0 , 1 ); 

• Determine the index, i*: i* = k for Pvfe-i ft)) < n* < Px(x*(t)); (see Fig. 7.4) 

• Select a new sample, Xj k (t) => Xj(t) and weight, W lk (t) = 2- based on the new 
sample index, 4; and 

• Generate the new random (resampled) measure: {%, lT !lt (f)}; for k = 1,..., N p . 

where recall the CDF is given by: P x(xk(t)) = Ya.L\ Wk(t)p(x(t) - Xk(t )) with /i(-) 
is a unit step function. 

The final sampling scheme we discuss has a low weight variance, the residual 
method. 


7.4.3 Residual Resampling 

The residual resampling method is based on the idea of estimating the number of 
times each particle should be replicated, that is, the i th -particle is replicated, say 

Np(i) := 1nt(E{Np(m = Int(Ap x W { (t)) (7.26) 

times where Int means the “smallest integer value of”. 
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The remaining particles are sampled using the multinomial sampling method 
discussed above. Here we have 


N p 

N p (t):=N p -J^N p (i) (7.27) 

i-1 

with corresponding weights 

Wi(t) = =J—(N p Wi(t) - NJi )) (7.28) 

N p (t) 

The overall effect is to reduce the variance by E{(N p (i) — E{N p (i)}) 2 }, since the 
particles cannot be replicated less than E{N p (i)} times. 

We summarize the residual resampling method by: 

• Given a random measure at time t, that is, a set of particles and weights, 
{x i (t),W i mi=h...,N p -, 

• Calculate N p (i): N p (i) = lnt(N p x W/(f)); 

• Multinomial Sample: x,(t) => x&t) for N p (i); and 

• Update the new random (resampled) measure: {xj k , W lk (t)}\ for k = 1,..., N p . 

So we see that there are a variety of resampling schemes that can be employed 
to solve the particle degeneracy problem. We can now update our generic particle 
filtering algorithm to incorporate a resampling procedure and alleviate the degeneracy 
problem created by the variation of the weights. 

To visualize the “resampling” approach mitigating the particle degeneracy prob¬ 
lem, the SIR is illustrated in Fig. 7.5. Here we show the evolution of the particles and 
weights starting with the uniform weighting (Wft — 1) = ^-). Once the initial weights 
are updated, they are resampled uniformly. Next they are propagated using the state- 
space transition mechanism (model), then updated using the measurement likelihood 
producing the measure, {x,(f), Wft)}, i— \,N p leading to an approximation of the 
posterior distribution at time t. This measure is then resampled, propagated, updated 
and so on. A generic flow diagram of the algorithm is shown in Fig. 7.6 where we 
again illustrate the basic ingredients of the SIR technique. 


7.5 STATE-SPACE PARTICLE FILTERING TECHNIQUES 

There are a number of pragmatic PF techniques that have been introduced in 
the literature. Here we discuss some of the more robust and popular techniques 
that have been applied to a wide variety of problems starting with the bootstrap 
processor [1, 2, 6], 
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Pr[*(0|r,] Pr[4r+l)|y, +1 ] 

EH 0 EED 

FIGURE 7.5 Evolution of particle filter weights and particles using the sequential 
state-space SIR algorithm: resampling, propagation-step (state-space transition model), 
update-step (state-space measurement likelihood), resampling_ 

7.5.1 Bootstrap Particle Filter 

The basic “bootstrap” algorithm developed by Gordon, Salmond and Smith [16] is 
one of the first practical implementations of the processor to the tracking problem. 
It is the most heavily applied of all PF techniques due to its simplicity. Thus, the 
SSPF algorithm takes the same generic form as before with the minimum variance 
approach; however, we note that the importance weights are much simpler to evaluate 
with this approach which has been termed the bootstrap PF, the condensation PF, or 
the survival of the fittest algorithm [2, 16, 22], 

As mentioned previously, it is based on sequential sampling-importance¬ 
resampling (SIR) ideas and uses the transition prior as its underlying proposal 
distribution, 


q prior (x(t)\x(t - 1), Y<) = Pr(x(t)\x(t - 1)) 

The corresponding weight becomes quite simple and only depends on the 
likelihood; therefore, it is not even necessary to perform a sequential updating because 


(Pv(y(t)\x(t)) x Pr(x(t)\x(t - 1))\ 


W(t - 1) = Pr(y(0Wt)) x W(t - 1) 


Pr(x(t)Wt - 1)) 
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State-space SIR algorithm 

||||f -1>4« -1)| gr [x(t - l)|y t -t] | 


|{^(0,w;(0}| | SIflIilil| 


|fo(*),H5(*)}| |,Pr L U/)Y,l 


|fo(0,w;.(0}| ,=> I^W.Igl 

FIGURE 7.6 State-space SIR particle filtering algorithm structure: initialization, propaga¬ 
tion (state transition), updating (measurement likelihood), resampling. 

since the filter requires resampling to mitigate variance (weight) increases at each 
time step [16]. After resampling, the new weights become 

mt) -* => W(t) = Pr(y(t)\x(t)) = C(y(t)\x(t)) 

revealing that there is no need to save the likelihood (weight) from the previous step! 
With this in mind we summarize the simple bootstrap particle filter algorithm in 
Table 7.2. One of the primary features as well as shortcomings of this technique 
is that it does not use the latest available measurement in the importance proposal, 
but only the previous particles, x,(f — 1), which differs from the minimum variance 
approach. Also in order to achieve convergence, it is necessary to resample at every 
time step. In practice, however, many applications make the decision to resample 
based on the effective sample-size metric discussed in the previous section. 

In order to construct the bootstrap PF, we assume that: (1) x;(0) ~ Pr(x(0)) is 
known; (2) A(x{t)\x(f — 1)), C(y(t)\x(t)) are known; (3) samples can be generated 
using the process noise input and the state-space model, A(x(t— 1 ),u(t — 1), wit — 1)); 
(4) the likelihood is available for point-wise evaluation, C(y(t)\x(t')) based on the 
measurement model, Cixit), v(t)); and (5) resampling is performed at every time-step. 
To implement the algorithm we 

• Generate the initial state, x,(0) 

• Generate the process noise, iu;(f) 
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• Generate the particles, x,(t) = A(xf(t — I), u(t — I), wj,(f — 1))—the prediction- 
step 

• Generate the likelihood, C(y(t)\Xi(t)) using the current particle and 
measurement—the update step 

• Resample the set of particles retaining and replicating those of highest weight 
(probability), x,(t) =>■ x,(t) 

• Generate the new set, {x ,-(f), VV,(0) with VV/(t) = ^ 

Next we revisit the model of Jazwinski [21] in Chapter 5 and apply the simple 
bootstrap algorithm to demonstrate the PF solution using the state-space SIR particle 
filtering algorithm. 

Example 7.1 

Recall the discrete state-space representation of the basic problem given by the 
Markovian model: 

x(t) = (1 - 0.05A7>(t - 1) + 0.04A7x 2 (t - 1) + w(t - 1) 
y(t) = x 2 (t)+x 3 (t) + v(t) 

where At = 0.01, u)~7V(0,10 -6 ) and r ~7V(0,0.09). The initial state is Gaussian 
distributed with x,(0) ~J\fQc( 0), P(0)) and x(0) = 2.0, P(Q) = 10 -2 . 

We selected the following simulation run parameters: 


Number of Particles: 250 

Number of Samples: 150 

Number of States: 1 

Sampling Interval: 0.01 sec 

Number of Measurements: 1 

Process Noise Covariance: 1 x 10 -6 

Measurement Noise Covariance: 9 x 10 -2 

Initial State: 2 

Initial State Covariance: 10 -20 


Thus, the bootstrap SIR algorithm of Table 7.2 for this problem becomes: 

1. Draw samples (particles) from the state transition distribution: x,(f) —> Af(x(t): 
a [x(t - 1)], R u , u ,), that is, generate 

wilt) -* Pr(w(0)~A/'(0,R wu ,) 

and calculate {x,-(f)} using the process model and wj,(t) 


c,(t) = (1 - 0.05AT)x ( (t - 1) + 0.04ATx 2 (t - 1) + w,(t - 1) 
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2. Estimate the weight/likelihood, Wi(t) = C(y(t)\ Xi (t))^ AT(y(t) : c[x,(f)], R m ,(7)) 

c[x i (t)]=xf(t)+xf(t) 

lnC(y(0|x,(0) = ln2 nR vv - ^ ~ ^ ~ 

3. Update the weight: W;(f) = C(y(t)|x,(t)) 

4. Normalize the weight: W,(t) = VU/(Y)/W;(0 

5. Decide to resample if N e g < Nthresh 

6. If resample: Xj(t) =>■ x,(t) 

7. Estimate the instantaneous posterior: 

N p 

Mx(»\Y t ) « ^VV/(r)SWO-*/(0) 


8. Estimate (inference) the corresponding statistics: 

A'mapCO = arg max Pr(x(t)|T f ) 

N p 

XmmsfJO = E{x(t)\Y,} = 

A'MEDiAN(t) = median(Pr(x(t) | F f )) 

Note that compared to the previous examples of Chapter 5 for the extended 
Bayesian processors, we have added more process noise to demonstrate the effective¬ 
ness of the bootstrap processor. The results of the bootstrap PF are shown in Fig. 7.7. 
The usual classical performance metrics are shown: zero-mean (0.03 < 0.17), white¬ 
ness (0.78% out) and WSSR (below threshold) all indicating (approximately) a tuned 
processor. The PF tracks the state and measurement after the initial transient error has 
diminished. 

In Fig. 7.8 we show the bootstrap state and measurement estimates (inferences), 
that is, the MAP and MMSE compared to the modern sigma-point processor SPBP 
( UKF ). The plots demonstrate that the PF can outperform the sigma-point design, 
that assumes a unimodal Gaussian distribution. The estimated state and predicted 
measurement posterior distributions are shown in Fig. 7.9 along with a time-slice 
in Fig. 7.10 at time 1.04 sec demonstrating the capability of the bootstrap PF to 
characterize the multimodal nature of this problem. AAA 

This completes the development of the most popular and simple PF technique. 
We mention in passing that a simple pragmatic method of preventing the sample 
impoverishment problem is to employ a method suggested by Gordon [16] and 
refined in [17, 18] termed particle “roughening” which is similar to adding pro¬ 
cess noise to constant parameters when constructing a random walk GM model. 
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FIGURE 7.7 Nonlinear trajectory simulation/estimation for low process noise case: 
(a) Simulated state and mean, (b) Simulated measurement and mean, (c) Bootstrap 
state estimate, (d) Bootstrap measurement estimate, (e) Zero-Mean/Whiteness test 
(0.03 < 0.17/0.78% out), (f) WSSR test (below threshold). 


Roughening 2 consists of adding random noise to each particle after resampling is 
accomplished, that is, the a posteriori particles are modified as 

Xi(f) = m + G(0 (7.29) 


2 Roughening is useful in estimating embedded state-space model parameters and is applied to the joint 
state/parameter estimation problem in Sec. 8.4 to follow. 
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FIGURE 7.8 Nonlinear trajectory Bayesian estimation comparison for low process noise 
case problem: (a) State estimates: MAP, MMSE, UKF, median, (b) Predicted measurement 
estimates: MAP, MMSE, UKF, median. 

where e, ~ Af(0, diagDc M n Nf 1 ^ Nx ]) and k is a constant “tuning” parameter (e.g. 
~0.2), M„ is a vector of the maximum difference between particle components 
before roughening, the ^-element of M. given by (see Sec. 8.4 for application): 

M n = max \xf\t) - )| for n = 1,.. .,N X 


(7.30) 












Probability Probability 
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X-updated posterior distribution (Pr[x(t)| Y,]) 



Y-predicted posterior distribution (Pr[y(f)| Y M ]) 



FIGURE 7.9 Nonlinear trajectory Bayesian estimation instantaneous posterior distribu¬ 
tions: (a) Updated state distribution, (b) Predicted measurement distribution. 


























Ainiqeqojd-Bcri 



31 


260 


(likelihood for bootstrap) distribution. 
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Next we consider some alternative approaches that attempt to approximate the 
minimum variance importance distribution more closely for improved performance. 

7.5.2 Auxiliary Particle Filter 

The auxiliary particle filter employing sampling-importance-resampling ( ASIR ) is 
a variant of the standard SIR [8]. It is based on attempting to mitigate two basic 
weaknesses in particle filtering: (1) poor outlier performance; and (2) poor posterior 
tail performance. These problems evolve from the empirical approximation of the 
filtering posterior which can be considered a mixture distribution. 

The basic concept of ASIR is to mimic the operation of the minimum variance (opti¬ 
mal) importance distribution, qMv(x(t)\x(t — l),y(t)), by introducing an auxiliary 
variable, /C, representing the weight of the mixture used in the empirical prediction 
distribution estimate. The idea is to perform resampling at time (t — 1) using the 
available measurement at time t before the particles \xj(t)} are propagated to time t 
through the transition and likelihood distributions. The key step is to favor particles at 
time ( t — 1) that are likely to “survive” (largest weights) at the next time-step, t. The 
problem is that these schemes tend to introduce a bias into the estimated posterior that 
must then be corrected by modifying the weight of the remaining particles. Thus, the 
ASIR is a two-stage process such that: (1) particles with large predictive likelihoods at 
time-step (t — 1) are propagated; and (2) the resulting particles are then re-weighted 
and drawn from the resulting posterior. 

Following the development in Cappe [6], we start with a proposal over the entire 
path {X,} up to time t under the assumption that the joint posterior at time (t — 1) is 
well approximated by a particle representation, {Wft — I), X, _ i (;)}. Thus, the joint 
importance proposal for the “new” particles {X,(i)} is 

PAST NEW 

q(X t ) = q{X t -\\Y t j x q(x(t) |x(f - l),y(f)) (7.31) 

Note that the “past” trajectories depend on the data up to time-step t to enable the 
adaption to the new data, y(f), while the “new” conditional importance distribution 
(q —> q,\fv) incorporates the new state, x(t). We substitute an empirical distribution 
for Pr(Z f _i | Y t ) centered on the previous particle paths {A)_i (;)} 

N p 

q(X t -i\Y t ) « X>(f - mXt-i -Xr-i(O) (7.32) 


where YlfZi — 1) = 1 and 1C ft — 1) > 0. The i th weight (probability mass) for 
each particle in this proposal is based on the pre-selected particles that are a “good 
fit” to the new data point y(t). One choice [8] for these weights is to choose a point 
estimate of the state such as its mean, 

m(f) = f x(t) x Yrixitfxft - 1 ))dx(t) 


(7.33) 
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and then compute the weighting function as the likelihood evaluated at this point 
ICi(t - 1) = C(y(t)\rhi(t)) 

as in the bootstrap technique [16] or if the particles {x,(t — 1)} are weighted [6], then 
ICi(t - 1) = Wi(t - 1) x C(y(,t)\m t (t)) 

which follows from the marginal Pr(Z r _i|F ( )—a smoothing distribution. That is, 
using the particle approximation from time (f — 1) and expanding, we obtain 

Pr(X r _i|T r ) cx J Vr(X,- \ \ Y,-\) x A(x(t)\x(t - 1)) x C(y(t)\x(t))dx(t) (7.34) 

and using the empirical distribution approximation for the first term gives 

Np r 

Pr(X r _i| Y t ) to Y, W ‘ (t - ')^-i - Xt- 1(0) X / C(y(t)\x(t))A(x(t)W.t - 1 ))dx(t) 

(7.35) 

One approximation [8] A(x(t)\xi(t — 1)) S(x(t) — rhi(t )) leads to the estimator 

N p 

Pr(X f _i \Y t ) « ^ C(y(0|w,-(Q) x Wj(t - 1) x 6(X,_i - X t -i (/)) (7.36) 

i_1 combined weight 

giving the desired result above 

m - 1) := Wi(t - 1) x C(y(t)\rhm (7.37) 

Using this proposal, the generalized weight 


is determined from the ratio of the posterior 


Pr(X,|r,) cx / C(y(t)\x(t)) x A(x(t)\x(t - 1)) x Vv(X t ^\Y t ^)dx(t) 


to the joint proposal giving 

Pr(X ? |T f ) Wi(f- 1) C(y(t)\m) x A(xi(t)\xi(t - 1)) 

Wa„x(j,t) = - = - X - 

q(X t ) m - I) q(xi{t)\xi(t - 


(7.38) 
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TABLE 7.3 Auxiliary SIR State-Space Particle Filtering Algorithm 


Initialize 

*,(0)~Pr(*(0)) Wi(0)=l/N p i = \,..., N p [sample] 

Auxiliary Weights 

Bootstrap Processor: 

m,(r) « E{x(t)\ [bootstrap mean] 

Weight calculation 

Kj(t - 1) = Wilt - 1) x C(.v(/)|m,(0) [bootstrap likelihood] 


Weight Normalization 

JCi(t — 1) = Kill — \)I YKj(t ~ 1) [auxiliary weight] 

Resampling 

Select indices {/(/)} using {7C j0> (r - 1)}: x t (t) =>• Xj (i} (t) [resample] 

a,(t — 1) := — 1 )/JCj^(,t — 1) [first stage weights] 

Importance Sampling Proposal (Optimal ) 

Xi(t) ~ qCxi(t)\xi{t - l),y(f)) [sample] 


W0:mou(t- 1) 


Weight Update 
C(y(t)\Xj(t)) x A{Xj{t)\Xi(t - 1)) 
qiMmit - \),y(t)) 


Weight Normalization 

w ra (/,o = my/Ym) 


N Distribution 

Mx(t)\Y t )« £ W aux (i,t)5(40 - */(»)) 


[weight-update] 


[normalize] 


[posterior distribution] 


which follows directly from substitution of the empirical distributions [6], Here the 
bias correction, 1//C;, is introduced into the sampler to correct the auxiliary weight 
(first stage). 

We summarize the auxiliary particle filter in Table 7.3. First, the bootstrap tech¬ 
nique is executed to generate a set of particles and auxiliary weights at time-step 
(t — 1), that is, { Wi(t),Xj(t)} which are used to estimate the set of means (or modes) 
{m,(t)} as in Eq. 7.33 for the “smoothing” weight calculation of Eq. 7.37 generat¬ 
ing the probability masses, {/C,(f — 1)}. These weights are then used in resampling 
to generate the set of “most likely” particles and weights under the new sampling 
indices (/(*)}, {£/(,■)(* - !),*/(,)(f - 1)}, that is, Xj(t - 1)=^*/,)(* - 1); i = 1, ... ,N P . 
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Next, samples are drawn from the optimal proposal and used to update the 
auxiliary weights using the resampled particles and first stage weights defined by 
o!/(t— I) := W,(t — \)/JCj(t — 1). The posterior distribution is estimated using the 
auxiliary weights to complete the process. 

If the process is governed by severe nonlinearities or contaminated by high pro¬ 
cess noise, then a single point estimate such as m,(t) does not sufficiently represent 
the transition probability Pr(x(t)|x,(t — 1)) very well. Therefore, we can expect the 
ASIR performance to be poor even yielding weaker results than that of the bootstrap 
processor. However, if the process noise is small, implying that a point estimate can 
characterize the transition probability reasonably well, then the ASIR is less sensitive 
to “outliers” and the weights will be more uniformly balanced resulting in excellent 
performance compared to the bootstrap. These concepts can be used as an aid to help 
decide when such a technique is applicable to a given problem. 

7.5.3 Regularized Particle Filter 

In order to reduce the degeneracy of the weights in the SIR processor, resampling 
was introduced as a potential solution; however, it was mentioned that even though 
the particles are “steered” to high probability regions, they tend to lose their diversity 
among other problems introduced by such a procedure [1, 44], This problem occurs 
because samples are drawn from a discrete rather than continuous distribution (see 
Sec. 1.5). Without any attempt to correct this problem, the particles can collapse 
to a single location giving a poor characterization of the posterior distribution and 
therefore result in poor processor performance. 

One solution to the diversity problem is to develop a continuous rather than dis¬ 
crete approximation to the empirical posterior distribution using the kernel density 
estimator of Sec. 3.2 and then perform resampling directly from it. This is termed 
a regularization-step resulting in diversification by a form of “jittering” the par¬ 
ticles; thus, the processor is called the regularized particle filter (RPF ). The key 
idea of the RPF is the transformation of the discrete empirical posterior distribution, 
Pr(x(t) | Y t ) —»■ Pr(x r Y t ) in order to resample from an absolutely continuous distribution 
producing a “new” set of /V p -particles with different locations. 

To be more precise let us define the properties of a kernel that can be applied 
to this problem [7], A regularization kernel /C(x) is a symmetric probability 
density function such that: (1) /C(x)> 0; (2) / fC(x) dx — I; (3) / x/C(x) dx = 0; 
and (4) / ||x 2 ||)C(x)dx < oo and for any positive bandwidth A* the corresponding 
rescaled kernel is defined by 

/Ca,(x)=^^ ^(x - ) forxe7e ^ Xl (7.44) 

The most important property of the kernel density follows from its regulariza¬ 
tion property, that is, for any distribution V(x) e IZ Nx x 1 , the regularization results 
in an absolutely continuous probability distribution, /Ca*(x)*'P(x), with * the 
convolution operator, such that 

j -[£ a ,( x ) * P(x)] = f K, Ax (x - a)V(a) da 


(7.45) 
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If V is an empirical distribution, then 


N p 

V(x) « ^ 5(x - x,) 


for X/ —> {xi,..., xjv p } a sample from the posterior and therefore 

N N p N p 

^[/C A ,(x) * V(x)] = (J- 'j X 'pWiK - *i) (7.46) 

Both the bandwidth and the kernel are selected in practice to minimize the mean 
integrated error between the posterior and regularized distribution. Classical ker¬ 
nels result under specialized assumptions such as the Epanechnikov, Box, Triangle, 
Gaussian etc. (see [7], Chapter 12 for more details). 

One of the underlying assumptions of this transformation is that the true posterior 
Pr(x r | Y t ) has a unity covariance which is not the case when implementing the RPF 
technique. Therefore, at each time-step we must estimate the ensemble mean and 
covariance by the usual sample approach given by 

1 Np 

m(f) = 

;= 1 

1 Np 

R xx(t) = r- J2 fe(0 - m{t)){ Xi {t) - m(t))' (7.47) 

N p j= l 

With this calculation, we factor the covariance using the Cholesky decomposition to 
yield the matrix square roots used in a whitening transformation (unity covariance), 
that is, 

R xx {t) = L l/2 (t)L T/2 (t) 
which leads to the new scaled kernel 


<7 - 48) 

The old particles are then “jittered” by using the step 

Xi(t) = Xi(t) + A x Lkt)ei(t) (7.49) 

where the {e,(t)J are drawn from the new scaled kernel above. A typical example of 
kernel density estimation of a discrete probability mass function is shown in Fig. 7.1 
where the circles represent the discrete point masses (impulses) at the particular 
location and the continuous approximation to the probability density is shown by the 
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smooth curve provided by the kernel density estimate using a Gaussian window. This 
completes the RPF technique which is summarized in Table 7.4. Next we discuss 
another popular approach to produce particle diversity. 

7.5.4 MCMC Particle Filter 

Another approach to increase the diversity in the particle set {xft), Wft)}; i — \.... ,N p 
is to take an MCMC step(s) with the underlying invariant distribution targeted as the 
posterior Pr(X t \ Y t ) on each particle [45]. The MCMC particle filter is available in two 
varieties: (1) MCMC-step(s) with the usual BSP\ and (2) full MCMC iterative filter. 
The sequential processors use the MCMC- steps as part of the resampling process for 
especially insensitive (weight divergence) problems, while the full MCMC iterative 
processor is available as a separate algorithm typically executed using the Metropolis, 
Metropolis-Hastings or Gibbs samplers of Chapter 3. We confine our discussion to 
the MCMC- step approach, since we are primarily interested in sequential techniques 
and refer the interested reader to [1, 23, 24] for the iterative approach. 

The main idea is that the particles are distributed as Pr(A,(/)| Y t ), then applying a 
Markov chain transition kernel defined by 

T(X, \X, (0) := Pr(W \X t (i)) (7.50) 

with posterior invariant distribution such that 

Pr(2Q|Tf) = f T(X t \X,{i)) x Pr(X t (i)\Y t )dX t (i) (7.51) 

continues to result in a particle set with the desired posterior as its invariant distri¬ 
bution. However, the new particle locations after the move result in high probability 
regions of the state-space. It has been shown that by applying the MCMC transition 
kernel that the total variance of the current distribution can only decrease [46]. Any 
of the MCMC methods ( M-H , G-S, S-S, etc.) can be incorporated into the SMC 
framework to achieve the desired move occuring after the resampling operation. 

Following [46,47] the objective is to move the set of particles using a combination 
of importance sampling, resampling and MCMC sampling, that is, the approach is to: 

• Initialize the set of particles yielding: [x,(t)j; i=l,...,N p 

• Resample this set to obtain: {xft)}; i= 1,..., N p 

• Move using an MCMC step(s) to generate the “new” set of particles 
{xi(t)}\i=l,...,N p with Xi(f) ~ Pr(X,\Y t ) and transition kernel, T(Jc,-(t)|X,(i» 

The move-step performs one or more iterations of an MCMC technique on each 
selected particle after the resampling step with invariant distribution Pr(A' / 1 Y t ). Note 
that before the move, the resampled particles are distributed xft) ~ Pr(X ? |T r ); there¬ 
fore, the “moved” particles, xft) are approximately distributed by this posterior as 
well. The move-step improves particle diversity by enriching the particle locations to 
those highest probability regions. 
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TABLE 7.4 Regularized Particle Filtering Algorithm 

Initialize 

Draw: x,(0) ~Pr(x(0)) W,(0) = J- i=l,...,N p 

Importance Sampling 

Draw: xft) ~ A(x(t)\xj(t - 1)) 

Weight Update 

wm = c(y{t)\xm 4 = cc*<4 «(o, ko); «~ mm 

Weight Normalization 

W i (t) = W i (t)/ ±Wi(t) 

Resampling Decision 

N eff = \/ N ±Wf{t) 

| Resample <N t hresh 
— [Accept >Nthresh 

Regularization Sample Statistics 

m(t)=\/N p £x i (t) 


Rxx(.t) = l/N p £ (x,(0 - rnltmm - m(t))' 


Factorization 


R xx (t) = L 1/2 (t)L T ' 2 (t) 


XiO)=>Xi(t) 


Resampling 

Diversification 


Draw: e,(f) - IC Ax ( x{t ) - £,-(*)) 


Diversify 


x i (i)=m+&0 0 ‘(i)€i(t) 

Distribution 

Pr(x(t)\Y ,)« J W,(f)<5(x(r) - x,(0) 


[sample] 

[state transition] 

[weight/likelihood] 


[effective samples] 
[decision] 

[sample mean] 

[sample covariance] 

[Cholesky decomposition] 

[sample] 

[generate sample] 


[posterior distribution] 
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For instance, let us “track” the i th particle with corresponding high valued weight. 
Based on its associated weight (probability), Xi(t) => Xi(t) is selected according to 

Prtt,M = Xi(t)) = W,(t) 

Resampling results in replication of the /'^'-particle, /V r times, producing the 
resampled set, [xi\ (f),x,2(t),. ■., v/v,(0)■ Next the move-step moves each replicant 
xy(t) —>■ Xjj(t) to a distinct (unique) location in the region of strongest support dictated 
by Pr(A, | Y t ). This provides the “move-step” for the SMC technique and mitigates the 
divergence problem. 

To be more specific, let us illustrate the “move” by choosing the Metropolis- 
Hastings technique of Table 3.1 to perform our MCMC -step using the random walk 
M-H approach of Sec. 3.4. 

We start with the basic bootstrap PF of Sec. 7.5 to obtain the set of resampled 
particles {x,(f)}. Using the random walk model, we perform the “move” step to obtain 
the new set of particles as 


X,(t) = Xj(t) + 6i(t) for 6/ ~ p E (e) 


(7.52) 


One choice is e,- ~ p E (e) = J\f(0,R ee ), since it is symmetric, easy to generate and 
will simplify the calculations even further. The corresponding acceptance probability 
for the M-H approach is specified by 


A(xj(t),Xi(t)) = min 


[ PrCX t (0|U)) 

l Pr$X i )!>,)) 


q(xi(t)\xj(t )) 
q(Xi(t)\Xi(t))’ 


(7.53) 


where Xi(t) := (x,(f), W-KO} and Xj(t) := {x,(t), W-i(0} are the augmented sets of 
joint random particles. Drawing samples from the random walk with (symmet¬ 
ric) Gaussian distribution enables us to simplify the acceptance probability, since 
q(xi(t)\xi(t)) = q(xi(t)\xi(t)) canceling these terms in Eq. 7.53 to produce 


A(xi(t),Xi(tj) = min 


f Pr(X t (i)\Y t )) 1 

\pr(X l (l)\Y t )) , J 


(7.54) 


But from the sequential Bayesian recursion, 


Pr(2ft | Y t ) a \>r(y(t)\x(t))Yv(x(t)\x(t - 1)) x Pr(W-ilU_,) 


we obtain 


A(xi(t), Xi(t )) = min 


Pr(y(01*«(0) x Pr(xj(t)\xj(t - 1)) 1 
Pr(y(0l£(0) x Pr(xi(t)\x,(t -!))’ ( 


(7.55) 


With this information in mind, the implementation of the bootstrap PF with (ran¬ 
dom walk) MCMC-step is given in Table 7.5. Another approach to improve particle 
diversity is using local linearization techniques which can be implemented with any 
of the classical/modem algorithms of the previous two chapters. 



7.5 STATE-SPACE PARTICLE FILTERING TECHNIQUES 269 


TABLE 7.5 MCMC Particle Filtering Algorithm 


Initialize 

Draw: jc,( 0) - Pr(x(0)) W,(0) = — i= 1. N„ 

N p 

Importance Sampling 

Draw: A(x(t)\x,(t- \ )) 


W,(t)=C(y( OMt)) 


Weight Update 


W l {t) = W l (t) / 


c r _ j Resample < N t h re sh 
eff — | Accept > Nthresh 


m=>m 


Weight Normalization 


Resampling Decision 


Resampling 


Diversification Acceptance Probability 


A(xi(t)M)) =' 


[ C(y{t)\Xj{t)) x A{Xj{t)\Xj{t - 1 )) 1 

1 C(y(t)\k(t)) x ACxAUWt - 1)) J 


Draw: eft) ^ fif {Q,R e A 


Diversify 


Xi(t)=Xi(t) + ei(t) 

Draw: u k -»• 77(0,1) 


[sample] 


[state transition] 


[weight/likelihood] 


[effective samples] 
[decision] 


[sample] 
[generate sample] 
[uniform sample] 


Decision 

p,0) if u k <A(x h Xi) 

[Jcj(f) otherwise 


Distribution 


Pr(x(r)i 7 ( )« £ mmxit) - xm 


[posterior distribution] 
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7.5.5 Linearized Particle Filter 

Attempts to approximate the minimum variance importance proposal distribution of 
Sec. 7.2 continue to evolve [24], The motivation for this is based on inadequacies 
created by selecting the transition prior of Sec. 7.5.1 as the proposal leading to the 
popular bootstrap algorithm [16] of Table 7.2. As mentioned previously, the bootstrap 
approach requires resampling to mitigate the particle depletion problem and lack of 
incorporating the most recent measurement in the weight update. These reasons have 
led to the development of the PF that incorporates the latest measurement sample. 
One way to do so is to generate an approximately Gaussian importance proposal based 
on linearization methods [4, 7, 9] with the idea of selecting 

Pr(x(t)|X f _i, Y t ) jw q M (x(t)\Y t ) (7.56) 

as a Gaussian proposal. This approach is used to provide coverage of the actual 
posterior due to its long-tailed distribution while incorporating the latest available 
measurement. This Gaussian proposal is the result of marginalizing the prior state, 
x(t — 1), of the minimum variance proposal. That is, 

qu(x(t)\Y,) Pr(x(t)| Y t ) = f Yr(x(t)\x(t - 1),}-(?)) x Yr(x(t - \)\Y,^)dx(t - 1) 

(7.57) 

In a sense this implicit marginalization effectively averages the proposal with 
respect to the previous posterior, Pv(x(t — I )| K,_i), which incorporates all of the 
“information” about x{t — 1). Thus, we see that by choosing the Gaussian importance 
distribution as our proposal enables us to implicitly incorporate all of the knowledge 
available about the previous state as well as incorporate the current measurement, 
y(f). These features make q_\f(x(t)\ Y r ) a reasonable choice as long as it provides the 
overlapping support or coverage of the desired posterior [9], 

One of the linearization approaches to implementing the minimum variance 
importance function, qMv(x(t)\x(t — l),y(t)) is to estimate it as a Gaussian prior, 
that is, 


q MV (x(t)\x(t - l),y(f)) ~ N(x(t\t),P(t\t)) (7.58) 

where the associated (filtered) conditional mean, x(t\t) =E{x(t)\ Y,} and error covari¬ 
ance, P(t\t) = E{x{t\t)x! {t\t)} for x(t\t) = x(t) — x(t\t) are obtained from an additional 
estimation scheme. 

There are a variety of choices available, each evolving from either the classical 
(linearized, extended or iterated Kalman filters) or the modem unscented (sigma- 
point) Kalman filter. Further alternatives are also proposed to these approaches [4], 
but we will confine our discussion to these popular and readily available approaches 
[9], In each of these implementations a linearization, either of the nonlinear dynam¬ 
ics and measurement models in the classical case or in the statistical linearization 
(unscented transformation) as in the unscented or sigma-point case occurs. All of 
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these “linearized-based” processors provide the updated or filtered conditional mean 
estimate 

xm = m - 1) + K(x*(t))e(t) (7.59) 

where 

x(t\t— 1) is the predicted conditional mean, E{x(t)\ Y t - 1}; 

K(x*(t)) is the gain or weight based on the particular linearization technique; and 
e(t) is the innovation sequence. 

Here the choices are: 

x*(t ) —> x 0 (t) is a reference trajectory in the linearized case; or 
x*(t) x(t\a) is a extended or iterated cases ; and 
x*(t) —> Xi(t\t) is a sigma-point in the unscented case. 

The error covariance is much different in each case, that is, 

P(t\t) = (I - K(x*(t))C[x*mmt - 1) (7.60) 

in the linearized and extended cases with measurement Jacobian 

C[i*(0]=^| (7-61) 

CtXj | % _ x* 

and, of course, 

Ht\t - 1) = A[x*(t)]P(t - 1| t- l)A'[x*(t)] + R ww (t - 1) (7.62) 

forA[x*(0] = 

In the sigma-point (unscented) case, we have 

P(t\t) = P(t\t - 1) - K(x*(t))R ee (t)K'(x*(t)) (7.63) 

So we see that depending on the linearization method or the particular classical 
or unscented processor we select, we will generate the Gaussian prior at each time- 
step which is used to obtain the minimum variance importance distribution. Thus, 
we see why they are called the class of “local” linearization-based particle filters. 
The linearized particle filter algorithm is shown in Table 7.6 where we see that the 
conditional mean and covariance are estimated in each time-step of the algorithm and 
particles are drawn from 

Xi(t ) -* Pr(x(?)l Y,) ~ (7.64) 

the updated “Gaussian” estimates and the weights also follow the importance 
distribution of Eq. 7.58 given by 

C{y{t)\xj{t)) x A(x(t)\ Xi (t - 1)) 
qMv(x(t)\xi(t - l),y(0) 


(7.65) 
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TABLE 7.6 Linearized Particle Filtering Algorithm 

Initialize 

Draw: x,(0)~Pr(x(0)) W,(0) = i N p 

Linearization 

LZKF/EKF/IEKF/UKF Processor 

Importance Sampling 

Weight Update 

C(y(.t)\XiU))xA(x(t)\ Xi (t-l )) 

' U qMv(x(t)\ Xi (t-l),y(t)) 

for 

qm{x{t)\Xi{t - l),y(f)) =N{x i {t\PAt)) 

Weight Normalization 

mt)=w l (t)/ fwu) 

Resampling Decision 


Neff = ' / E W ‘ ( t ) 

^ _ | < N, h resh Resample 

eff 1 > N thK s h Accept 

Xi(t)=>Xi(t) 

Pr(x(t)\Y t )« £ Wi(t)S(x(t) -x,(0) 


Resampling 


Distribution 


[sample] 


[state draw] 


[MV weight] 


[effective samples] 
[decision] 


[posterior distribution] 


This completes the linearized particle filters, next we consider some of the practical 
considerations for design. 


7.6 PRACTICAL ASPECTS OF PARTICLE FILTER DESIGN 

Monte Carlo methods, even those that are model-based, are specifically aimed at 
providing a reasonable estimate of the underlying posterior distribution; therefore, 
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performance testing typically involves estimating just “how close” the estimated 
posterior is to the “true” posterior. However, within a Bayesian framework, this 
comparison of posteriors only provides a measure of relative performance but does 
not indicate just how well the underlying model embedded in the processor “fits” 
the measured data. Nevertheless, these closeness methods are usually based on the 
Kullback-Leibler (KL) divergence measure [58] providing such an answer. However 
in many cases we do not know the true posterior and therefore we must resort to 
other means of assessing performance such as evaluating mean-squared error ( MSE ) 
or more generally checking the validity of the model by evaluating samples generated 
by the prediction or the likelihood cumulative distribution to determine whether or 
not the resulting sequences have evolved from a uniform distribution and are i.i.d. 
(see Sec. 3.3)—analogous to a whiteness test for Gaussian sequences. 

Thus, PF are essentially sequential estimators of the posterior distribution employ¬ 
ing a variety of embedded models to achieve meaningful estimates. In contrast to the 
BP designs which are typically based on Gaussian assumptions, the PF have no such 
constraints per se. In the linear case, a necessary and sufficient condition for opti¬ 
mality of the linear BP is that the corresponding innovations or residual sequence 
must be zero-mean and white (see Sec. 5.6 for details). In lieu of this constraint, a 
variety of statistical tests (whiteness, uncorrelated inputs, etc.) were developed in Sec. 
5.6 evolving from this known property. When the linear Bayesian processors were 
“extended” to the nonlinear case, the same tests were performed based on approxi¬ 
mate Gaussian assumptions. Clearly, when noise is additive Gaussian these arguments 
can still be applied. These same statistical tests can also be performed based on the 
on the innovations or residual sequences resulting from the PF estimates {MAP, ML, 
MMSE ) inferred from the estimated posterior distribution. However, some other more 
meaningful performance tests for the PF can also be applied for improved design and 
performance evaluation. 


7.6.1 Posterior Probability Validation 

Much effort has been devoted to the validation problem with the most significant 
results evolving from the information theoretical point of view [59]. Following this 
approach, we start with the basic ideas and quickly converge to a reasonable solution 
to the distribution validation problem [59, 60]. 

Let us first define some concepts about probabilistic information necessary for the 
development. These concepts are applied extensively in communications problems 
and will prove useful in designing parametric signal processors. The information 
(self) contained in the occurrence of the event &>,- such that X(coj) = Xi, is 


X(x«) = —log^Pr (X{coi) = Xi ) = —log 6 Pr fe) (7.66) 


where b is the base of the logarithm which results in different units for information 
measures (e.g., base 2 —> bits, while base e —> implies nats). The entropy or average 
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information is defined by 

N N 

H(x,) := E x [Uxi)} = J2Zt&W.Xi) = - £ Prfe) log,Pr(;q) (7.67) 


Mu tual information is defined in terms of the information available in the occurence 
of the event Yicof) =yj about the event X{wi) = Xi or 


l(xi-,yj) = log, P |^ ) = log,Pr(x,|y/) - log,Pr(x,) (7.68) 

Now using these concepts, we take the information theoretic approach to distri¬ 
bution estimation following [59, 60]. Since many processors are expressed in terms 
of their “estimated” probability distributions, quality or “goodness” can be evalu¬ 
ated by its similarity to the true underlying probability distribution generating the 
measured data. 

Suppose Pr(x,) is the true discrete posterior probability distribution and Pr(x<) is 
the estimated distribution. Then the Kullback-Leibler Information ( KL ) quantity of 
the true distribution relative to the estimated is defined by using 



N N 

= J2 Pr C*;) In Pr(jc,) - J2 Pr ^<) ln Pr(x,) (7.69) 


where we chose log, = ln. The KL possesses some very interesting properties which 
we state without proof (see [60] for details) such as 

1. X KL (Pr(xi); Pr(x ( )) > 0 

2. l KL (Vr( Xl ); Prfir,)) = 0 Pr(x,) = Pr(x ; )Vi 

3. The negative of the KL is the entropy, 7f*x(Pr(x<); Pr(.r,)) 

The second property implies that as the estimated posterior distribution approaches 
the true distribution, then the value of the KL approaches zero (minimum). Thus, 
investigating Eq. 7.69, we see that the first term is a constant specified by the true 
distribution; therefore, we only need to estimate the average value of the estimated 
posterior relative to the true distribution, that is, 

N 


C(xi) := ExJlnPrC*,)} = Prfe) In Pr(x,) 


(7.70) 
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where £(x,) is defined as the average log-likelihood of the random variable of value 
In Pr(x,). Clearly, the larger the average log-likelihood, the smaller the KL implying 
a better model. 

The third property, entropy, is approximately equal to jj times the probability that 
the relative frequency distribution of N measurements obtained from the estimated 
posterior equals the true distribution. 

The Ikl is applied frequently to parameter estimation/system identification prob¬ 
lems to estimate the intrinsic order of the unknown system [5], Two popular 
information metrics have evolved from this theory: the Akaike Information Crite¬ 
rion (A/C) and the minimum data length ( MDL ) description [59, 60, 62, 63]. Both 
are used to perform order estimation and are closely related as shown below 

AIC(n ) = -In R ee + 2- 
N 

MDL(rj) = — ln/? ee + | In N 

where rj is the system order, e is the one-step prediction error with corresponding 
covariance R ee and N is the number of samples (data) values. 

However, our interest lies in comparing two probability distributions to determine 
“how close” they are to one another. Even though Ikl does quantify the difference 
between the true and estimated distributions, unfortunately it is not a distance mea¬ 
sure due to its lack of symmetry. However, the Kullback divergence ( KD ) defined by 
a combination of Ikl 

JKD(Pr(xi); Pr(x,)) = Z*x(Pr(x,); Pr(x,)) + I KL (M -*;); Pr(x,)) (7.71) 


is a distance measure between distributions indicating “how far” one is from the other. 
Consider the following example of this calculation. 

Example 7.2 

We would like to calculate the KD for two Gaussian distributions, p,(x) ~ Af(mj, Vj); 
i = 1,2 to establish a closeness measure. First, we calculate the KL information. 


x kl( Pi W; P 2 W) = E Pl {in J = £ Pl j ^ In ^ 


(x - m 2) 2 
2V 2 


(x - mi) 2 1 

2Vi J 


Now performing the expectation term-by-term gives 

Ikl( Pi(x);p 2 (x)) = ^ In ^ J (x-m 2 ) 2 p l (x)dx- ^ / (* - mi) 2 p x (x)dx 


Since the last term (under the integral) is the variance, Vi, it is simply — \ and 
therefore all we need do is expand and perform the integration of the second integral 
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term-by-term to give 


Z*x(Pito;p 2 to) 

= ^ ln ^“ + ~W 2 [/ x2p](x ^ dx ~ 2mi / x Pito<& +m 2 J Pi (x)dx 

or identifying terms from the properties of moments, we obtain 


Z*x(Pito;p 2 to) = ^ In ^ + ^-[(Vi + m}) - 2m 2 mi + rr 


1 

2 


Finally, we have 


V\ + (mi - m 2 ) 2 1 

2V 2 2 


Performing the same calculation, we get 


and therefore the KD is 


Jkd( Pi (.x); p 2 to) = X KL (V\ W; P 2 W) + ^l(p 2 (^); Pi to) 

= v\ + (mi - m 2 ) 2 (yi + y 2 ) + vl 1 

2ViV 2 


This completes the example. AAA 

The KL and therefore the KD can also be determined by probability distribu¬ 
tion estimation using MC sampling techniques [64, 65]. As another example of this 
approach consider how to apply the KL information to distinguish between a unimodal 
Gaussian and a Gaussian mixture. 

Example 7.3 

Suppose we would like to test whether a given data set is from a Gaussian distri¬ 
bution specified by M(m, V) or from a Gaussian mixture distribution specified by 
pM(m \, V\) + (1 — p)Af(m 2 , V 2 ) where p is the mixing coefficient (probability). The 
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Kullback divergence can easily be calculated and compared to a bound a to determine 
“how close” the data is to a mixture or a Gaussian, that is, 

J CTJ (Pr(x,);PrG,)) = J K DipM(m, Vi) + (1 - pW(m 2 , V 2 )\ M(jn, V )) < a 

In order to perform the test we choose m and -Jv to minimize the KD above. 
Solving we obtain 


m = pm\ + (1 - p)m 2 ; and V = pV i + (1 - p)V 2 + p( 1 - p)(m\ - m 2 ) 2 


A typical bound of a = 0.1 appears to perform well in distinguishing a single Gaussian 
from a mixture. AAA 

This completes the section, next we discuss an entirely different approach to this 
distribution problem by investigating “goodness of fit” testing. 

7.6.2 Model Validation Testing 

Of the major practical concerns with all model-based processors is whether or not the 
model embedded in the processor “matches” the underlying phenomena and can be 
used to extract meaningful information from the noisy measurements. As mentioned, 
in the classical Gaussian-based techniques, the zero-mean/whiteness testing of the 
innovations is a critical measure of this match. These properties are also used exten¬ 
sively for linearized models evolving from nonlinear dynamic systems as well [5]. In 
all of these cases the distributions are considered unimodal and typically Gaussian. 

In the non-unimodal (non-Gaussian) case the diagnostics are more complicated. 
The roots of MC model diagnostics lie in the basic Uniform Transformation Theorem 
of Sec. 3.3 and the works of Rosenblatt [48] and Smith [49] under the general area 
of “goodness-of-fit” statistical tests [39, 40, 50, 51, 57, 58]. 

The fundamental idea is based on analyzing the predicted measurement cumu¬ 
lative distribution, Py(y(f)|T f _i). A processor is considered consistent only if the 
measurement y(t) is governed by the statistics of its predicted cumulative distribu¬ 
tions. Therefore, validation consists of statistically testing that the measurements 
“match” the predictions using the underlying model embedded in the processor. By 
defining the residual sequence as 


e(0 := Py(y(t)|Tf-i) = Pr(F(t) < y(f)|T r _i) = f Pr(y'(/)|T, ,>//(/) (7.72) 

JY<y 

we must show that [e(f)} is a valid realization of an independent, identically dis¬ 
tributed, process uniformally distributed on the interval [0,1] given the measurements 
Y t -1. Thus, the statistical test validates whether or not the sequence is e(t) ~ U( 0,1) 
(component-wise for the vector case). 
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More formally, we use Rosenblatt’s theorem 3 that states, if y(t) is a continuous 
random vector and the underlying model is “valid”, then the corresponding sequence 
[e(t)} is i.i.d. on [0,1], Under the assumption that the residual sequence is stan¬ 
dard uniform, then we can transform to obtain an equivalent Gaussian sequence [48] 
such that 

v{t ) = <t> _1 0(0) for v ~ Af(0,1) with e ~ ZY(0,1) (7.73) 

where 1 is the inverse standard Gaussian cumulative distribution. Once the residual 
sequence is transformed, then all of the classical Gaussian statistical tests can be 
performed to ensure validity of the underlying model. 

With this in mind we are still required to solve two basic problems: (1) the esti¬ 
mation of the residual sequence e(t) or equivalently the estimation of the predictive 
measurement cumulative distribution, Py(y(t)|T f -i); and (2) the diagnostic statis¬ 
tical testing of e{t) or equivalently v(f), that is, demonstrating that e~ZY(0,1) or 

v~7V(0,1). 

The key to estimating the residual sequence is based on representing the predictive 
cumulative distribution as an infinite mixture [39, 40, 51] 

e(t) = Py(y(t)|T/-i) = J Py(y(t)W0) x Vr(x(t)\Y t ^)dx(t) (7.74) 

An MC approach to this estimation problem is to “particulate” the required under¬ 
lying distribution and estimate the empirical prediction cumulative distribution or 
equivalently the residuals. Perhaps the simplest MC technique is to sample from 
the predicted state distribution directly “when possible”, that is, if we could replace 
the prediction distribution with its empirical representation, then we could obtain an 
estimated residual [57], that is, 


, = / P 


or simply 


— X! 8 {x(t) - 


N p 

e(t)= ^X] p y(y(f)k(0) 


However, if direct sampling of the predicted state distribution is not possible, 
then the predictive decomposition into the transition prior and posterior can be 


3 The theorem states that for a given random vector ye1Z N ’ x> with corresponding distribution Py(y), 
the transformed vector, e = Ty, is uniformly distributed on the N y -hypercube for Pr(e)= n;=i € i when 
0< < 1. The transformation, T is given by e, = Py(yj|Ti_i); i = 1,... ,N y [48]. 
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accomplished, that is, 

Pr(x(t)|F f _iJ = J Vr(x(t)\x{t - 1)) x Pr (x(t - 1)|F,_0 dx(t - 1) (7.77) 

or using the state-space representation of the transition prior 

PrWOl^-t) = J A(x(t)\x{t - 1)) x Pr(x(t - l)|F,_i) dx{t - 1) (7.78) 

and performing a particle approximation, we obtain the empirical estimate of the 
(t — l)-step posterior as 


Pr(x(f - l)|F ? _i) = Wk(t ~ 1)5 (x(t - 1) - x k {t - 1)) (7.79) 

k=l 

Substituting this expression into Eq. 7.78 gives the prediction distribution 

Pr(.v(0|F,. a ) = j A(x(t)\x(t- I)) 

" n p 

X Xj V\4(t - 1)5 (x(t - 1) - x k (t - 1)) | dx(t - 1) 

N p 

= J2w k (t-\)A(x(t)\x k (t-W 


and the residual of Eq. 7.74 becomes 


e(t) = / PyCf(t)|x(t)) 


YjWAt ~ \)A{x(t)\x k {t - 1)) 


dx(t) (7.81) 


If we draw a sample, x,(f), from the transition prior, x,- ~ A(x(t)\x k (t — 1)) and use 
the perfect sample approximation, 

A(.x(t)\x k (t - 1)) « 8(x(t) - x,(0) (7.82) 

then substituting into Eq. 7.81 for the transition distribution yields 

N p 

Ut) = Xj m(t - 1) J Py(F(0 W0) X 5 (X(0 - X;(0) dx(t) (7.83) 
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giving the final expression for the estimated residual as [39, 40, 51] 

N p 

e(0 = Wk{t ~ 1)PyCv(0I^(0) (7.84) 

k= 1 
for 

Py(y(0l*«(0) = [ C(y(t)\xj(t)) dy(t) 

which can be estimated through direct integration or using an empirical CDF [14], if 
necessary. 

Once we have obtained the estimate of the residual c(t) (or equivalently the trans¬ 
formed residual v(0), we can perform diagnostic tests to evaluate the “goodness- 
of-fit” and therefore evaluate the validity of the embedded dynamic model. 

There are a variety of diagnostic tests that can be performed which include: x 2 - 
testing ( C-sq ), Kolmogorov-Smirnov (K-S) tests, normality testing (graphically), 
zero-mean/whiteness testing, etc. Here we concentrate on the C-sq and K-S as well as 
moment testing of the transformed residuals. Zero-mean/whiteness testing was 
already discussed in Sec. 5.6. 

Any of the Kalman techniques can also be used to generate an approximation to the 
sequence of residuals or prediction cumulative distribution, using the empirical PDF, 
EKF, UKF, Gauss-Hermite ( G-H ) grid-based integration as well as the Gaussian 
mixture approach, that is, Gaussian sums ( G-S ) (see [51, 66-69] for more details). 

7.6.2.1 Chi-Square Model Validation Test Chi-square ( C-Sq ) tests are 
hypothesis tests with the null hypothesis, Ho, that an A-length sequence of data 
is a random sample from a specified distribution against the alternative that it is not 
[14, 15]. It is usually applied to Gaussian distributions. C-Sq tests are based on the 
fact that the exponent of a Gaussian distribution, that is, the square of the ratio of 
the random variable minus its mean divided by its standard deviation is chi-square 
distributed with N degrees of freedom, that is, (x — ji/rr) 1 ~ x 2 (/V). However, this 
test can be applied to any distribution. 

For instance, if the random variable is binomially distributed with B (N, p) for p 
the success probability, then it takes the same form as the exponent above—it is 
distributed (in the limit) as y 2 (l). Extending this to a multinomial distribution with 
parameters N and p(i) for i = 1,..., k — 1; M.(N, {p(0}), then in the limit as N —> oo 
the test statistic, say C^-i, has an approximate y 2 (k — 1) distribution. 4 

Hypothesis testing that H 0 is true using the test statistic is specified by the value 
k (probability bound) of the y 2 (k — 1) using Pr(C*_i >ic) = a for a the significance 
level of the test. Thus, if the test statistic is less than k the null hypothesis is accepted 
and the selected distribution is correct. 


4 To be more specific, if the i.i.d. random variables, >’i,... ,y k -\ are multinomially distributed 
with y k = N — Xlti y> an d P(fc) = 1 — 2Zf=i' P(0> then the statistic, C k -i = JZfr/ (y,- — N p (i)) 2 /N p (i) is 
C-k i X' ik - l)|I51. 
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For our problem, a goodness-of-fit test of the residuals follows directly from their 
uniformity property of an adequate model. The most common statistical tests for uni¬ 
formity follows from the x 2 -test based on segmenting the estimated residual sequence 
on [0,1] into subintervals and testing. The chi-square statistical test can be used 
to decide whether or not the residual sequence e(t) is U( 0,1) or equivalently the 
transformed residual, u(t) is Af(0, 1). 

The C-Sq test statistic for our problem is given by 


^ (nS) ~ e ) 2 

C Ne -i = 2 L. —|— 


(7.85) 


where N is the total number of residual samples; N e is the number of bins (equally 
spaced subintervals); n e (i ) is the number of residual counts in the i th -bm (subinterval); 
and € is the expected number of counts per bin given by e = . 

If the residual sequence is uniform, then C^-t ~ X 2 W — 1) and k is compatible 
with a x 2 (N f — 1) distribution at significance level, a. Therefore, if Cn e ~ i < k the 
null hypothesis that the residual sequence is uniformly distributed is accepted and 
the model is adequate (validated) otherwise it is rejected. Thus, the x 2 ' m °del 
validation test is: 


• Partition the /V-sample residual sequence into N € bins (equally spaced 
subintervals) 

• Count the number of residual samples in each bin, n e (i); i = 1,..., N e 

• Calculate the expected bin count, e = N /N e 

• Calculate the test statistic Cn ( -\ of Eq. 7.85 

• Test that C^-i <k [Accept H 0 ] 


Example 7.4 

Suppose we have a residual sequence, e(t), scaled on [0, 1 ] of N = 1000 samples and 
we would like to test that it is uniformly distributed using the x 2 - m odel validation 
test. We partition the sequence into N e — 10 bins; therefore, the expected counts per 
bin is e = 100. At the a = 5% significance level, the test statistic, 

C Ne -1 = 3.22 

is less than k (probability bound) accepting the null hypothesis that the sequence is 
uniform and therefore the model is validated. AAA 

Next we consider another more robust method for goodness-of-fit testing. 

7.6.2.2 Kolmogorov-Smirnov Model Validation Test The chi-square 
goodness-of-fit test suffers from the limitations of arbitrary interval widths and the 
requirement of large data sets. An alternative or complementary approach is the 
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Kolmogorov-Smirnov (K-S) goodness-of-fit test that is based on deciding whether or 
not a hypothesized (e.g., uniform) or estimated cumulative distribution characterizes 
a given data sequence (e.g., residual). The hypothesis test is given by: 

H a : P E (e) = P„(€) 

Hi: P £ (e) + P 0 (€) (7.86) 

where P^ is the underlying (estimated) population CDF and P 0 is the hypothesized 
CDF. The test statistic used in making the decision is: 

JC = max |P £ (e) - P 0 (e)| (7.87) 


where P^ is given by the empirical distribution function estimate 

Ptv(e) = § -* Pr (E <€) = P fi (€) as N -* oo (7.88) 

N 

For large N, JC& 0 with FL 0 true while for Hi true, K is close to the maximum 
difference. Therefore, we reject H 0 if /C > k with k a constant determined by the 
level-of-significance of the hypothesis test, a. That is, 


a = Pr(/C > k | Ho) « 2e~ 2Nlc2 


(7.89) 


Thus the K-S -test is: 

• Estimate the empirical CDF, P^(e) 

• Calculate K-S test statistic /C from Eq. 7.87 

• Test that 

K < \ [Acce P l U o\ 

Example 7.5 

We have a residual sequence, e(f), scaled on [0, 1 ] of N = 25 samples and we would 
like to perform the K-S test that the samples are uniformly distributed at the a = 5% 
significance level. The test statistic, 


K. = 0.17 

is less than k = 0.26 accepting the null hypothesis that the sequence is uniform and 
therefore the model is valid. The test is shown in Fig. 7.11 as the location of the 
maximum deviation between the hypothesized and empirical distributions. AAA 
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FIGURE 7.11 Kolmogorov-Smirnov model validation test of residual sequence: hypoth¬ 
esized and empirical CDF. 

When the transformed residuals are used, then the standard zero-mean/whiteness 
testing can be accomplished as well as estimating the moments of the Gaussian 
distribution which we discuss in the next section. 

7.6.2.3 Moment-Based Model Validation Test When the residuals are 
transformed from the hypothesized uniformly distributed sequence to a standard 
Gaussian, v ~7V'(0,1), then a wealth of normality diagnostics can be applied to val¬ 
idate the adequacy of the embedded model. Besides the zero-mean/whiteness and 
WSSR tests of Sec. 5.6 for Gaussian processes, the usual suite of diagnostics can be 
applied to estimate the underlying moments to check for consistency of the Gaussian 
assumption and therefore model validation. The brute force approach is simply to cal¬ 
culate the mean-squared (estimation) error over an ensemble of runs of the processor 
and any truth model available for comparison, 



(7.90) 


where © can be any parameter, state or measurement estimated by © and the expec¬ 
tation can be calculated by integrating or solving over an ensemble generated by 
executing the processor a multitude of times and averaging. This can be very costly 
and sometimes impossible because of the lack of the “truth”. 

Another approach is to calculate a set of statistical indexes that can be used to 
“qualitatively” assess performance and model validity using the transformed residual 
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sequence, u(f) [51-53]. Here the first four central moments of an A-sample sequence 
of transformed residuals are estimated based on 

1 N 

m v (k) = -V (v(t) - m v (l)) k for k >2 (7.91) 

N ‘- 1 

t= 1 

with the first moment the sample mean 

1 N 

m v (\)=-Y J v{t) (7.92) 


These moments can then be used to construct the following diagnostic indices 
which are asymptotically distributed Af(0, 1) (see [51] for details). 


• Bias Index: Bn = s/Nm v {\) 

• Dispersion Index: T>n = Nm ^2(N- if ~ 


• Skewness Index: 

• Tail Index: 

• Joint Index: 


c A , _ kN+m-t 
— V 6(N-2) 


Tn = (N+ 1 ) 

J N = Sl + Tfi 


■</(N + 3)(N + 5) v ( m v ( 3) 


From a pragmatic perspective these indices are used in a more qualitative manner 
even though they are quantitative. They are used to expose “surprising values”, that 
is, for N not too small, the indices can be bound by some constant, fi and com¬ 
pared with upper and lower quantile estimates of their exact distribution [51]. For 
instance, consider the quantile, V\r a , of the exact dispersion index distribution under 
the assumption of a valid model. Then it can be shown [51] that 

V N , a = (Xaf — (N — l)A/2 (N - 1)) (7.93) 

where xjj -1 a is X 2 (N ~ 1) distributed. Other measures such as the skewness and tail 
indices, Sn and 7iv can be obtained from MC simulations [51]. 

If Bn is surprisingly high or low, the measurements tend to be larger or smaller then 
predicted (>’(?)), while a surprisingly high or low dispersion index, T>i y indicates that 
the measurements are under or over dispersed. The Sn and Tn are useful in analyzing 
the measurement distribution. A higher or lower Sn indicates a skew to either right or 
left while a higher or lower Tn indicates longer or shorter tails respectively. The Jn 
is an asymptotically equivalent to normality tests [50, 54], A suite of other statistics 
exist for testing correlations [55, 56] as well. 

This completes the section on practical aspects of PF design and analysis. We will 
couple these statistical tests to the classical whiteness testing techniques to evaluate 
the performance of the processors. Next let us consider the design of a “bootstrap” 
processor on a canonical problem: a case study for population growth. 
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7.7 CASE STUDY: POPULATION GROWTH PROBLEM 

In this section we discuss the development of state-space particle filters ( SSPF ) for the 
population growth problem. We consider this well-known problem that has become 
a benchmark for many of the PF algorithms. It is highly nonlinear and nonstationary. 
Thus, consider the problem of [25] and [20, 26-29], 

The state transition and corresponding measurement model are given by 

1 25x(t - 1) 

x{t) = -x(t - 1) + ] + \ ([ _ + 8 cos(1.20 - D) + w{t - 1) 

x 2 0) 

y(t) = + v0) 


where At =1.0, ic~7V’(0,10) and u~7O(0,1). The initial state is Gaussian 
distributed with x(0) ~ 70(0.1,5). 

In terms of the nonlinear state-space representation, we have 


«w<-i>] = i*(r-i ,+ ifw^p 

b[u(t — 1 )) = 8cos(1.2(f — 1)) 

r ^) 


In the Bayesian framework, we would like to estimate the instantaneous posterior 
filtering distribution. 


N p 

Mm Y, ) « iz w,8(x(t) - (7.94) 


where the unnormalized importance weight is given by 

,u ' ' 

The weight recursion for the bootstrap case is W,(t) — W t (t — 1) x C(y(t)\x(t)). 
Therefore, for the Bayesian processor, we have that the state transition probability is 
given by 


A(x(t)\x(t - 1)) ~ Af(x(t) : a[jc(f - 1)],R WU ,) 


(7.96) 
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Thus, the SIR algorithm becomes: 

1. Draw samples (particles) from the state transition distribution: Xj(t) —> Af(x(t) : 
a[x(t - l)],R ww ) 

Wilt) Pr(u>(r)) ~ jV(0, R ww ) 

Xi(t) = ^Xj(t - 1) + ^ + 8 cos(L2(f - 1)) + W{(t - 1) 

2. Estimate the weight/likelihood. 

Wilt) = C(y(t)\x(t)) AT(y(t ): c[x(t)],R w (0) 


c[x,(t)] 


xf(t) 

20 


3. Normalize the weight: W,(t) = W t (t)/ jjti W t (t) 

4. Resample: it; => x\ 

5. Estimate the instantaneous posterior: 


Pr(x(t)| K, ) « J2 WMxit) ~ Xi(t )) 


6. Estimate (inference) the corresponding statistics: 

A'map(T) = arg max Pr(x(Y)| Y t ) 

Np 

XmmsfJO = E{x(t)\Y t ] = J^Xi(t)Mx(t)\Y t ) 

^median(L) = median(Pr(x(t) | F f )) 

We show the simulated data in Fig. 7.12. In a we see the hidden state and b the 
noisy measurement. The estimated instantaneous posterior distribution surface for 
the state is shown in Fig. 7.13a while slices at selected instants of time are shown in 
b with the circles annotating particle locations normalized to a constant weight. Here 
we see that the posterior is clearly not unimodal and in fact we can see its evolution in 
time as suggested by Fig. 7.1 previously. The final state and measurement estimates 
are shown in Fig. 7.12 demonstrating the effectiveness of the PF bootstrap processor 
for this problem. Various ensemble estimates are shown (e.g., median, MMSE, MAP). 
It is clear from the figure that the EKF gives a very poor MMSE estimate since the 
posterior is not Gaussian (unimodal). 
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FIGURE 7.13 Nonlinear, nonstationary, non-Gaussian problem: (a) Instantaneous posterior sur¬ 
face. (b) Time slices of the posterior (cross-section) at selected time-steps with particle locations 
annotated by circles with constant amplitudes. 
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7.8 SUMMARY 

In this chapter we have discussed the development of state-space particle filters 
(,SSPF ). After introducing the idea of Bayesian particle filters, we showed how 
the state-space models could easily be interpreted in terms of this framework. We 
developed a generic state-space particle filtering algorithm based on the importance 
(sampling) proposals selected either using the minimum variance or transition prior 
approach. However we emphasized that in practice these techniques suffer from parti¬ 
cle depletion and lack of diversity because of ever-increasing weight variances causing 
divergence of the processors. We introduced the concept of resampling as a solution 
to the divergence problem and discussed a number of techniques to mitigate the diver¬ 
gence problem. With that in hand, we discussed the popular bootstrap particle filter 
and showed some examples to demonstrate its performance. We then proceeded to 
discuss improvements to the bootstrap approach attempting to approximate the min¬ 
imum variance proposal. These methods included the auxiliary, regularized, MCMC 
and linearized algorithms. Next we investigated some of the practical aspects of parti¬ 
cle filter design and developed a number of statistical tests to determine performance 
including both information theoretic approaches to validate the posterior distribution 
as well as diagnostic testing for model validation. We concluded the chapter with a 
case study on population growth—a nonlinear/non-Gaussian model presenting a very 
challenging problem for any particle filter design. Besides the references in the chap¬ 
ter there has been a wealth of particle filtering papers appearing in both the statistics 
and signal processing literature [30-43]. 


MATLAB NOTES 

MATLAB is command oriented vector-matrix package with a simple yet effective 
command language featuring a wide variety of embedded C language constructs 
making it ideal for signal processing applications and graphics. MATLAB has a 
Statistics Toolbox that incorporates a large suite of PDF s and CDF s as well as 
“inverse” CDF functions ideal for simulation-based algorithms. The mhsample 
command incorporate the Metropolis, Metropolis-Hastings and Metropolis 
independence samplers in a single command while the Gibbs sampling approach 
is adequately represented by the more efficient slice sampler (slicesample). There 
are even specific “tools” for sampling as well as the inverse CDF method captured 
in the randsample command. PDF estimators include the usual histogram (hist) 
as well as the sophisticated kernel density estimator (ksdensity) offering a variety 
of kernel (window) functions (Gaussian, etc.) and ICDF methods including the 
empirical cumulative distribution (ecdf) estimator. As yet no sequential algo¬ 
rithms are available. 

In terms of statistical testing for particle filtering diagnostics MATLAB offers 
the chi-square “goodness-of-fit” test chi2gof as well as the Kolmogorov-Smirnov 
distribution test kstest. Residuals can be tested for whiteness using the Durbin- 
Watson test statistic dwtest while “normality” is easily checked using the 
normplot command indicating the closeness of the test distribution to a Gaussian. 
Other statistics are also evaluated using the mean, moment, skewness, std, var 
and kurtosis commands. Type help stats in MATLAB to get more details or go to 
the Math Works website. 
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PROBLEMS 

7.1 Given a sequence of Gaussian data (measurements) characterized by 
y ~ N(ti, a 2 ), find the best estimate of the parameters defined by 0 := [/U. er]' 
using a “sequential” MC approach. Show the mathematical steps in developing 
the technique and construct a simple PF to solve the problem. 

7.2 Consider the following simple model [7] 

x(f) = ax(t - \) + w(t- 1) for w ~ M(0,R ww (i )) with l(t) = i 
y{t ) = x(t) + v(t) for v ~ Af( 0, R vv ) 

with Pr(1(0 = i\l(t - 1), x(t - 1)) = Vr(l(t) = i) = Pi 

(a) Suppose i — {1,2}, what is the distribution, Pr (T(t) = (i\, h),x\ |yi ,xq)? 

(b) How would the marginal be estimated using a Kalman filter? 

(c) Develop a computational approach to bootstrap PF algorithm for this 
problem. 

7.3 Suppose we have two multivariate Gaussian distributions for the parameter 
vector, © ~7V r (/r;, £,•); f = 1,2 

(a) Calculate the Kullback-Leibler (KL) distance metric, J. 

(b ) Suppose X = X| = Y. 2 , recalculate the KL for this case. 

7.4 An aircraft flying over a region can use the terrain and an archival digital map 
to navigate. Measurements of the terrain elevation are collected in real time 
while the aircraft altitude over mean sea level is measured by a pressure meter 
with ground clearance measured by a radar altimeter [7]. The measurement 
differences are used to estimate the terrain elevations and compared to a digital 
elevation map to estimate aircraft position. The discrete navigation model is 
given by 


x(t) = x{t - 1) + u{t - 1) + w(t - 1) for w ~ U( 0, R ww ) 
y(t) = c[x{t)] + v(t) for v ~ Af( 0, R vv ) 
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where x is the 2£)-position, y is the terrain elevation measurement, the navi¬ 
gation systems output u and w are the respective distance traveled and error 
drift during one time interval. The nonlinear function c[-] denotes the ter¬ 
rain database yielding terrain elevation outputs with v the associated database 
errors and measurements. Both noises are assumed zero-mean, Gaussian with 
known statistics, while the initial state is also Gaussian, x(0) ~ AfQc(0), P( 0)). 

(a) Based on this generic description construct the bootstrap PF algorithm 
for this problem. 

C b) Suppose: P( 0) = diag[ 10 4 10 4 ]', R ww = diag [25 25]', R vv =16 ,N= 150 
samples, u(t)= [25 25]'. Simulate the aircraft measurements and apply 
the bootstrap algorithm to estimate the aircraft position. 

7.5 In financial systems, a stochastic volatility model is used in the analysis of 
financial time series such as daily fluctuations in the stock market prices, 
exchange rates and option pricing. The volatility is usually expressed in terms 
of a time-varying variance with the model given by: 

y(t) = <y(t) X .e(0 e ~ 0/(0,1) 

In o 2 (t) = a + 0 In o 2 (t - 1) + In e 2 (t) 


or equivalently 


y(t') = e a(t>/1 x o(t) x e(0 e ~ Af( 0,1) 

In ct 2 (0 = m In a 2 (t — l) + v(t) v ~ Af(0, r 2 ) 

where o{t) corresponds to the time-varying volatility (amplitude) and the sec¬ 
ond relation represents the change in volatility. The parameters a and fi are 
regression coefficients, and the remaining parameters are the In r 2 (0 variance 
term. 

(a) Suppose we would like to estimate the unknown parameters augmenting 
the original state (In er 2 (0) with the unknowns, a, In r 2 . Assume the 
parameters can be represented by a random walk driven by zero-mean, 
Gaussian noise processes. What is the overall model for this financial 
system? 

(b) Construct a bootstrap PF for this problem. 

(c) Simulate the data for N = 1000 samples and estimate the volatility and 

parameters. The simulation parameters are: a— 1.8, 0.95, z 2 — 0.1, 

and e~A/"(0,1). 

7.6 Develop a suite of particle filters for the ftC-circuit problem of Ex. 5.1 where 
the output voltage was given by: 
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where e a is the initial voltage and R is the resistance with C the capacitance. 
The measurement equation for a voltmeter of gain K e is simply 


<W(0 = K e e(t) 


Recall that for this circuit the parameters are: R = 3.3 kQ and C = 1000 fxF, 
AT = 100 ms, e 0 = 2.5V, K e — 2.0, and the voltmeter is precise to within 
±4 V. Then transforming the physical circuit model into state-space form by 
defining x = e, y = e out , and u = Ii n , we obtain 

x(t) = 0.97x(t - 1) + 100«(f - 1) + w(t - 1) 
y(t) = 2 JC(0 + v(t) 

The process noise covariance is used to model the circuit parameter uncertainty 
with R ww = 0.0001, since we assume standard deviations, A R, AC of 1%. 
Also, R vv = 4, since two standard deviations are AV = 2[ \ 4 V). We also 
assume initially that the state isx(0) ~ Af( 2.5,10 -12 ), and that the input current 
is a step function of u(t ) = 300 \iA. 

With this in mind, we know that the optimal processor to estimate the state 
is the linear BP (Kalman filter). 

(a) After performing the simulation using the parameters above, construct a 
bootstrap PF and compare its performance to the optimal. How does it 
compare? Whiteness? Zero-mean? State estimation error? 

(b) Let us assume that the circuit is malfunctioning and we do not pre¬ 
cisely know the current values of RC. Construct a parameter estimator 
for A = 1 /RC using the EKF or UKF and compare its performance to the 
bootstrap and linearized PF. 

(c) Try “roughening” the bootstrap PF, does its performance improve? 
Compare results. 

7.7 Consider the storage of plutonium nitrate in a tank (see [5] for details), we 
would like to dynamically estimate the amount (mass) of Pu present at any 
given time. Losses occur due to radiolysis and evaporation. The underlying 
state-space model for this system is given by: 

Summarizing the process and measurement models in state-space form, we 
have 



where u is a step function of amplitude —Kjj- The corresponding measurement 
model based on pressure measurements is 

TAPJ \g/b ~(a/b)g\ \m(tj\ Mt)l 

[ap b J |_ 0 sh \ LpcoJ L»2(0j 
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discretizing the dynamics and incorporating the model parameters, we obtain 
the Gauss-Markov model with sampling interval of 0.1 day as 


x(t) = 


y(t) = 


'0.999 

_0 

'29.8 

0 


T 

" |o_ 


x(t - 1) + u(t - 1) + w{t - 1) 


-0.623' 

24.9 


x{t) + v{t) 


R ww = diag[10 10], R vv = diag[5.06 x 10 4 1.4 x 10 5 ] with initial conditions 
x(0|0)= [988 1455]' andP(0|0) = diag[0.01 0.01]. 

(a) Develop the optimal BP for this problem and compare its performance to 
the bootstrap PF. 

(,b ) How well does the bootstrap PF perform? How about the linearized PF ? 

7.8 We are asked to investigate the possibility of creating a synthetic aperture using 
a single hydrophone to be towed by an AUV in search of targets. We know 
that a single hydrophone offers no improvement in SNR in the sense of array 
gain, but also wonder about its capability to localize, especially more than one 
target. 

(a) Using the synthetic aperture model developed in the case study of Sec. 8.5, 
develop the bootstrap PF and UKF processors for this problem. Assume 
we would like to track two targets. 

(,b ) Perform the same simulation outlined in the case study for two targets 
at-45°,-10°. 

(c) Apply the bootstrap algorithm with and without “roughening” along with 
the UKF. Discuss processor performances and compare. Zero-mean? 
Whiteness? 

id) Implement the “optimal” PF processor using the EKF or UKF lineariza¬ 
tion. How does its performance compare? 

7.9 We are asked to investigate the possibility of finding the range of a target 
using a hydrophone sensor array towed by an AUV assuming a near-field 
target characterized by it spherical wavefront instead of the far-held target of 
the previous problem using a plane wave model. The near-held processor can 
be captured by a wavefront curvature scheme (see [5] for more details) with 
process and measurement models (assuming that the parameters as well as 
the measurements are contaminated by additive white Gaussian noise). The 
following set of dynamic relations can be written succinctly as the Gauss- 
Markov wavefront curvature model as 


©(4) = 0(4-1) + w(t k ) for @(4) = [a f a G 0 r 0 ]' 

Plitk ) = « 1 (4)e / ' 27r02to)te “ T<(0 ' A)) + w/(4); t = 1,... ,L 
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where a,f Q , 0 o and r 0 are the respective amplitude, target frequency, bearing 
and range. The time delay at the l th -sensor and time 4 is given in terms of the 
unknown parameters of 

4 ?( 0 ; 4 ) := * (o 4 (t k ) - y]ol<Jk) + d}{t) - 2 ^(t) 0 4 ( 4 )sin 0 3 ( 4 )^ 

for df (t) the distance between the I th sensor and reference range r 0 given by 
df (t) = if I d x 4 for X( the position of sensor l 


(i a ) Using this model develop the bootstrap PF and UKF processors for this 
problem. Assume we would like to estimate the target bearing, frequency 
and range (a = 1). 

(b) Perform a simulation with initial parameters r 0 = 3 Km, f a = 51.1 Hz and 
9 0 = 27° and true parameters at r = 2Km,/ = 51 Hz and 9 = 25°, L — 4. 

(c) Apply the bootstrap algorithm with and without “roughening” along with 
the UKF . Discuss processor performances and compare. Zero-mean? 
Whiteness? 

(<f) Implement the “optimal” PF processor using the EKF or UKF lineariza¬ 
tion. How does its performance compare? 

7.10 Consider the bearings-only tracking problem of Ex. 5.4 given by the state- 
space model. The entire system can be represented as an approximate Gauss- 
Markov model with the noise sources representing uncertainties in the states 
and measurements. The equations of motion given by 


'1 

0 

AT 

0 ' 



"0 

0 " 

■ 

' 

0 

1 

0 

AT 
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- 1 ) + 
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0 
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0 
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_-A v oy (t - 

- 1 ). 


with the nonlinear sensor model given by 
x\(t) 


for w~J\f(0,R ww ) and u~A/"(0 ,R m ). 

(i a ) Using this model develop the bootstrap PF and UKF processors for this 
problem. Assume we would like to estimate the target bearing, frequency 
and range (a = 1 ). 

(b) Perform a simulation with the following parameters: an impulse- 
incremental step change (Av ox = — 24 knots and Ai> ov = +10 knots) was 
initiated at 0.5 h, resulting in a change of observer position and velocity 
depicted in the figure. The simulated bearing measurements are shown in 



298 PARTICLE-BASED BAYESIAN STATE-SPACE PROCESSORS 


Fig. 5.6 d. The initial conditions for the run were x'(0) := [0 15 nm 20k 
— 10 k] and R wu , = diag 10 -6 with the measurement noise covariance given 
by R vv = 3.05 x 10 -4 rad 2 for AT = 0.33 h. 

(c) Apply the bootstrap algorithm along with the UKF. Discuss processor 
performances and compare. Zero-mean? Whiteness? 

(d) Implement the “optimal” PF processor using the EKF or UKF lineariza¬ 
tion. How does its performance compare? 



JOINT BAYESIAN 
STATE/PARAMETRIC 
PROCESSORS 


8.1 INTRODUCTION 

In this chapter we develop the Bayesian approach to the parameter estimation/system 
identification problem [1—4] which is based on the decomposition of the joint poste¬ 
rior distributions that incorporates both dynamic state and parameter variables. From 
this formulation the following problems evolve: (1) joint state/parameter estimation; 
(2) state estimation; and (3) parameter (fixed and/or dynamic) estimation. The state 
estimation problem is thoroughly discussed in the previous chapters. However, the 
most common problem found in the current literature is the parameter estimation 
problem which can be solved “off line” using batch approaches (maximum entropy, 
maximum likelihood, minimum variance, least squares, etc.) or “on-line” using the 
expectation-maximization (EM) technique (see Chapter 2), the stochastic Monte 
Carlo approach and for that matter almost any (deterministic) optimization tech¬ 
nique [5, 6], These on-line approaches follow the classical ( EKF ), modern ( UKF) 
and the sequential Monte Carlo or particle filter CPF). However, it still appears that 
there is no universally accepted approach to solving this problem especially for fixed 
parameters [7-9], From the pragmatic perspective, the most useful problem is the 
joint state/parameter estimation problem, since it evolves quite naturally from the 
fact that a model is developed to solve the basic state estimation problem and it is 
found that its inherent parameters are either poorly specified, just bounded or even 
unknown, inhibiting the development of the processor. We call this problem the “joint” 
state/parameter estimation, since both states and parameters are estimated simultane¬ 
ously on-line and the resulting processor is termed parametrically adaptive [18]. This 
terminology evolves because the inherent model parameters are adjusted sequentially 
as the measurement data becomes available. 


Bayesian Signal Processing. By James V. Candy 
Copyright © 2009 John Wiley & Sons, Inc. 




300 JOINT BAYESIAN STATE/PARAMETRIC PROCESSORS 


In this chapter, we concentrate primarily on the joint Bayesian state/parameter 
estimation problem and refer the interested reader to the wealth of literature avail¬ 
able on this subject [7-19]. First, we precisely define the three basic problems 
from the Bayesian perspective and then investigate the classical, modern and par¬ 
ticle approaches to its solution. We incorporate the nonlinear re-entry problem of 
Jazwinski [20] used throughout as an example of parametrically adaptive design and 
then discuss a case study to demonstrate the approach. 


8.2 BAYESIAN APPROACH TO JOINT STATE/PARAMETER 
ESTIMATION 

To be more precise, we start by defining the joint state/parametric estimation problem. 
We begin by formulating the Bayesian recursions in terms of the posterior distribution 
using Bayes’ rule for decomposition, that is, 

Pv(x(t),8(t)\Y t ) = Pr (x(t)\6(t), Y t ) x Pv(6(t)\Y r ) = Pr (d(t)\x(t), Y t ) x Pr(x(t)\Y t ) 

( 8 . 1 ) 

From this relation, we begin to “see” just how the variety of state and parameter 
estimation related problems evolve, that is, 

• Optimize the joint state/parameter posterior: 

Pr (x(t), 9(t) | Y,) [state/parameter estimation] 

• Optimize the state posterior: 

Pr (x{t) | Y t ) [state estimation] 

• Optimize the parametric posterior: 

Pr (9(t) | Y t ) [parameter estimation] 

Now if we proceed with the usual factorizations, we obtain the Bayesian decomposi¬ 
tion for the state estimation problem as 

partlyT- Pr 0'«W0)xPr W 0|F f - 1 ) 

( i ~ Pr(y(0|F,-i) 

Pr (x(t)\Y t -i) = j Pr {x{t)\x(t - 1)) x Pr(x(f - l)|F,_i)dx(t - 1) (8.2) 
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Equivalently for the parameter estimation problem, we have 


Pr(fl(t)[y f) = Pr( y (f)| ^ (f)) x Pr( fl ( 0l^r-i ) 

PrOKOI^-t) 

Pr {9{t)\Y t -\) = J Pr(0(f)|0(f - 1)) x Pr (9(t - l)|F, \)d6(t - 1) (8.3) 

Now for the joint state/parameter estimation problem of Eq. 8.1, we can substitute 
the above equations above to obtain the posterior decomposition of interest, that is, 


Pv(x(t),d(t)\Y t ) 


Pr{x(t)\d(t),Y t ) x [Pr(y(Q|6>(Q> x Pr(fl(t)|F f -i)] 
Pr (y(t)\Y t -i) 


(8.4) 


or 


Vx(x(t),e(t)\Y t ) = Pr (x(t)\0(t), Y t ) x Pr (y(t)\d(t)) 

f Pr (6(t)\Q(t — 1)) x Pr(#(f — l)\Y t -\)d6(t — 1) 
Pr(y(0|E f -i) 


(8.5) 


This is the most common decomposition found in the literature [7-19] and leads to 
the maximization of the first term with respect to x and the second with respect to 0 
[22, 23]. 

Alternatively, using the state/parameter estimation form which is rarely applied, 
we have 


_ Pr(0(OI x(t),Y t ) x [Pr(y(Q|x(f) x Pr(x(Q|F f -i)] 
^ ( U > Pr(y(f)|F f _!) 


(8.6) 


or 


Pr(0(t),x(t)\Y t ) = Pr m)\x(f),Y t ) x Pr (y(t)\x(t)) 


f Pr (x(t)\x(t - 1)) x Pr (x(t - \)\Y,- X )dx(t - 1) 
Pr (y(t)\Y t -i) 


(8.7) 


Here the first term is maximized with respect to 9 and the second with respect to x 
compared to the previous decomposition. 

So we see that Bayes’ rule can be applied in a number of ways to develop the sequen¬ 
tial Bayesian processors for the state, parameter and joint state/parameter estimation 
problems. 
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8.3 CLASSICAL/MODERN JOINT BAYESIAN STATE/PARAMETRIC 
PROCESSORS 

In this section, we develop the “joint” state-space Bayesian sequential processor 
or equivalently the parametrically adaptive Bayesian signal processor ( ABSP ) for 
nonlinear state-space systems. The A BSP is a joint state/parametric processor, since it 
estimates both the states as well as the unknown model parameters. It is parametrically 
adaptive, since it adjusts or “adapts” the model parameters at each time step. The 
simplified structure of the classical ( EKF ) parameter estimator is shown in Fig. 8.1. 
We see the basic structure of the ABSP which consists of two distinct, yet coupled 
processors: a parameter estimator and a state estimator. The parameter estimator 
provides estimates that are corrected by the corresponding innovations during each 
recursion. These estimates are then provided to the state estimator in order to update 
the model parameters used in the estimator. After both state and parameter estimates 
are calculated, a new measurement is processed and the procedure continues. In 
general, this processor can be considered to be a form of identifier, since system 
identification is typically concerned with the estimation of a model and its associated 
parameters from noisy measurement data. Usually the model structure is pre-defined 
(as in our case) and then a parameter estimator is developed to “fit” parameters 
according to some error criterion. After completion or during this estimation, the 
quality of the estimates must be evaluated to decide if the processor performance 
is satisfactory or equivalently the model adequately represents the data. There are 
various types (criteria) of identifiers employing many different model (usually linear) 
structures [2-4], Here we are primarily concerned with joint estimation in which the 
models and parameters are usually nonlinear. Thus, we will concentrate on developing 
parameter estimators capable of on-line operations with nonlinear dynamics. 


FIGURE 8.1 Nonlinear parametrically adaptive ( ABSP ): simplified processor structure illus¬ 
trating the coupling between parameter and state estimators through the innovation 
and measurement sequences. 
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8.3.1 Classical Joint Bayesian Processor 

From our previous discussion in Chapter 5, it is clear that the extended Bayesian 
processor XBP (extended Kalman filter) can satisfy these constraints nicely, so we 
begin our analysis of the XBP as a state/parametric estimator closely following the 
approach of Ljung [21] for the linear problem discussed in Sec. 8.2. The general non¬ 
linear parameter estimator structure can be derived directly from the XBP algorithm 
in Table 5.3. 

Recall the Bayesian solution to the classical problem was based on solving for 
the posterior distribution (see Eq. 8.2) such that each of the required distributions 
were represented in terms of the XBP estimates. In the joint state/parameter esti¬ 
mation case these distributions map simply by defining an augmented state vector 
X(t) := [x(f)|0(f)]' to give: 

Pr CKO MO) ~ M(c[x(t)],R vv (t)) 

Pr(y(t)\x(t),0m ~ M(c[x(t),8(t)],R vv (t)) 

Pr(x(0|T f -i) ~ Af(x(t\t- \\P{t\t- 1)) 

Pr(X(t)|F,-i) ~ AT(X(t\t - l),V(t\t - l)) 

Pr(><0|F, 0 ~ M(y{t\t-\),R ee (f)) 

Pr(y(t)|F,_i) ~ Ar(ye(t\t-l),R eeee (t)) (8.8) 

where 

X(t\t — 1) := \x(l\t- 1) I 9(t\t ~ 1)]' 

ye(t\t- 1) := c[x(t\t - Y),0(t\t - 1)] 

'VxM- 1) I Pxe(t\t - 1) 

V(t\t-l):= - 

_P ex (t\t-l) | Pge(f\t-V) 

n eo e 0 (t) := C[X(t\t - \)}V(t\t - \)C[X{t\t - 1)]' +R vv {t) 

To develop the actual internal structure of the ABSP, we start with the XBP equations, 
augment them with the unknown parameters and then investigate the resulting algo¬ 
rithm. We first define the composite state-vector (as before) consisting of the original 
states, x(t), and the unknown “augmented” parameters represented by Q(t), that is. 



X(t) := 


(8.9) 
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where t is the time index, X e R (N *+ N e> x 1 and x e R N * X 1 , 6 e R N ° X 1 . Substituting this 
augmented state vector into the XBP relations of Table 5.3, the following matrix 
partitions evolve as 


V{t\t- 1):= 


~P xx (t\t-l) | P xe (t\t-V>~ 
_P ex (t\t-l) | Pgg(t\t-l)j 


hMt- i) = P' xe (t\t- 1) 

( 8 . 10 ) 


where V e e p^ £ and p^ £ 

This also leads to the partitioning of the corresponding gain 


^(0 


for /C <= ^ e ^x/v, and Kfj e /jAfcxty. 

We must also partition the state and measurement predictions as equations, that is, 1 


X(t\t-l) = 


'x(t\t - 1)' 
_0(t|t-l) 


a[x(f — l|f — l),0(f — l|f — 1)] 

+ b[x(f - l|f - 1),0(f - \\t - 1), u(t - 1)] 


where the corresponding predicted measurement equation becomes 

y(t\t - 1) = c[x(t\t - i),m - 1)] (8.13) 

Next we consider the predicted error covariance 

p(t\t -1 )=Aim - 1), m - i)]P(Tit - rn'im - 1), m - i)i+*„«,(* -1) 

(8.14) 


1 Note that we have “implicitly” assumed that the parameters can be considered piecewise constant, 
@(t) = &(t— 1) or follow a random walk if we add process noise to this representation in the Gauss- 
Markov sense. However, if we do have process dynamics with linear or nonlinear models characterizing 
the parameters, then they will replace the random walk and the associated Jacobian etc. will change from 
this representation. 
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which can be written in partitioned form using Eq. 8.10 and the following Jacobian 
process matrix 2 


~A x [x(t\t - l)M\t ~ 1)] I A e [x{t\t ~ U,ht\t - 1)]' 
O I In s 


where 


db[x,6] 

dx 

db[x,e] 

do 


(8.15) 


(8.16) 


with A € # x +H , A x G R n * xN * , A e G R n * :xNe . 

Using these partitions, we can develop the ABSP directly from the XBP processor 
in this joint state and “parameter estimation” form. Substituting Eq. 8.15 into the 
XBP Prediction Covariance relation of Table 5.3 and using the partition defined in 
Eq. 8.10, we obtain (suppressing the x, 6 , time index t notation for simplicity) 


V(t\t- 1) 


X 1 Ai 


~A X | Ai 

+ 

~ R w x w 

1 0 

_0 | I Ne _ 


_0 | I Ne _ 


0 

1 Rw e wg_ 


Expanding these equations, we obtain the following set of predicted covariance 
relations 


~A X P XX A! X + A e P ex A' x + A x P xe A’ e + A e P ee A' e + R WxWx 
Pex A ' x + Pee A 'e 


I A x P xe + A e Pee~ 

I Pee + R wgwg _ 

(8.18) 


The innovations covariance follows from the XBP as 

R ee (t ) = C[x, B]P(t\t - 1 )C'[x, 0] + R vv (t) (8.19) 

Now we must use the partitions of V above along with the measurement Jacobian 

c[xM = \c x \m - wm - mceim- v,m- m ( 8 . 20 ) 


2 Here is where the underlying random walk model enters the structure. The lower block rows of A x could 
be replaced by [Ag x [x,d]\Agg[x, §]] which enables a linear or nonlinear dynamic model to be embedded 
directly. 
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with Cefi A » x (**+*»>, C x g R n > xN \C e & R N y xNf >. 

The corresponding innovations covariance follows from Eqs. 8.19 and 8.20 a; 


>r expanding 

Reeit) = C X P XX C' X + C e p ex c' x + C x P x eC' e + CePeeC' e + R V v 


R ee G R N > xN y. The gain of the XBP in Table 5.3 is calculated from these partitioned 
expressions as 


' K x (t) ' 

~CP XX C X + P xe C’ e )R- 

Ho 


ShxC' x + PeeC' e )R~ 

}{t)_ 


where K G k x &R n * xN >, Kf,&R Nf > xN y. With the gain determined, the 

corrected state/parameter estimates follow easily, since the innovations remain 
unchanged, that is, 

e(t) = y(t ) - %t\t - 1) = y(t) - c[x(t\t - \),m - 1)] (8.26) 

and therefore partitioning the corrected state equations, we have 

■ m) i \m-m \K x {t)e(t) 

X(t\t)= -=-+- (8.27) 

_ 9{t\t) \ \_kt\t - 1)J L Ke(t)e{t)_ 
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Finally, the corrected covariance expression is easily derived from the following 
partitions 



~P XX 1 Pxi 


" K x (t) ' 


~Pxx 1 Pxi 

Pm = 

PSx 1 P$0_ 


. *e(0 _ 

[C x \Cg] 

Pex 1 Pee. 


Performing the indicated multiplications leads to the final expression 


V{t\t) = 


~P XX - K X C X P XX - K x C e P ex 
P 6x ~ KgC x P xx - K e CgPg x 


| P x g - K x C x P x e - K x CgPgg~ 

| Pge - K e C x P xe - K e C e Pee_ 

(8.29) 


We summarize the parametrically adaptive model-based processor in predictor- 
corrector form in Table 8.1. We note that this algorithm is not implemented in this 
fashion, it is implemented in the numerically stable, “upper triangular-diagonal” or 
UD-factorized form as in SSPACK_PC [18], Here we are just interested in the overall 
internal structure of the algorithm and the decomposition that evolves. This completes 
the development of the generic ABSP. 

It is important to realize that besides its numerical implementation the ABSP is sim¬ 
ply the XBP with an augmented state vector thereby implicitly creating the partitions 
developed above. The implementation of these decomposed equations directly is not 
necessary—just augment the state with the unknown parameters and the ABSP evolves 
naturally from the standard XBP algorithm of Table 5.3. The ABSP of Table 8.1 indi¬ 
cates where to locate the partitions. That is, suppose we would like to extract the 
submatrix, Pgg, but the XBP only provides the overall (N x +Ng) error covariance 
matrix, P. However, locating the lower Ng x Ng submatrix of P enables us to extract 
Peg directly. 

Next let us reconsider the nonlinear system example given in Chapter 5 and 
investigate the performance of the parametrically adaptive Bayesian processor. 


Example 8.1 

Recall the discrete nonlinear trajectory estimation problem [20] of Chapter 5 
given by 


jc(/> = (1 - 0.05A T)x(t - 1) + 0.04x 2 (t - 1) + w{t - 1) 
with corresponding measurement model 

y(t) = x 2 (t) + x\t) + v{t) 

where v(t) ~Af( 0,0.09), x(0) = 2.0, P{ 0) = 0.01, AT = 0.01 sec and R ww = 0. 



TABLE 8.1 ABSP Algorithm 
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Jc(0|0) P(0|0) A[x,0] := |A[jc,0]| 
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Here we generalize the problem to the case where the coefficients of the process 
are unknown leading to the A BSP solution. Therefore, the process equations for this 
problem become 

x(t) = (1 - 6>i AT)x(t - 1) + 0 2 x\t - 1) + w(t - 1) 


with the identical measurement and covariances as before. The true parameters 
are: & true = [0.05 0.04]'. The A BSP can be applied to this problem by defining the 
parameter vector as 

©(f) = ©(f — 1) (constant) 

and augmenting it to form the new state vector X = [x 1 0\ ■ Therefore the process 

model becomes 


m = 


'(1 - 0i(f - l)A7>(f - 1) + e 2 (t - l)A7x 2 (f - 1)' 
di(t-D 

e 2 {t - 1 ) 


+ w(t -1) 


y(t) = x 2 (t) + x\t)+v(t) 


To implement the ABSP the required Jacobians are 
A[X(t — 1)] 

"[1 - 0i(f - 1)AT + 2A T9 2 (t - 1 )x(t - 1)] AT)x(t - 1) ATx 2 (t - if 
= 0 10 
0 0 1 

C[X(t - 1)] = [2 x(t - 1) + 3x 2 (f - 1) 0 0] 

Using SSPACK_PC [18] the ABSP is applied to solve this problem for 1500 samples 
with AT = 0.01 sec. Initially, we used the starting parameters: 

P( 0|0) = diag[100 100 100] and i(0|0) = [2 0.055 0.044]' 

The results of the ABSP run are shown in Fig. 8.2. We see the estimated state and 
parameter estimates in b and c. After a short transient (25 samples), the state estimate 
begins tracking the true state as evidenced by the innovations sequence in Fig. 8.2c. 
The parameter estimates slowly converge to their true values as evidenced by the 
plots. The final estimates are 


0i = 0.0470 
0 2 = 0.0395 

Part of the problem for the slow convergence results stems from the lack of sen¬ 
sitivity of the measurement, or equivalently, innovations to parameter variations 
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FIGURE 8.2 XBP (EKF) simulation, (a) Estimated state and parameter no. 1. (b) Estimated 
parameter no. 2 and innovation, (c) Predicted measurement and zero-mean/whiteness 
test (0.008 < 0.061 and 3.6% out). 

in this problem. This is implied from the zero-mean/whiteness tests shown in c. 
The innovations are statistically white (3.6% out), and zero-mean (0.008 < 0.061). 
The filtered measurement is also shown in c as well. This completes the ABSP 
example. AAA 

As pointed out by Ljung [1, 2, 21], it is important to realize that the XBP is sub- 
optimal as a parameter estimator as compared to the recursive prediction error ( RPE) 
method based on the Gauss-Newton (stochastic) descent algorithm. Comparing the 
processors in this context, we see that if a gradient term [Vg/f(@)] e(t) is incorporated 
into the XBP (add this term to Ag), its convergence will be improved approaching the 
performance of the RPE algorithm (see Ljung [21] for details). We also note in passing 
that the nonlinear BSP in the form developed in Chapter 5 as well as the parametri¬ 
cally adaptive ABSP are all heavily employed as neural networks. For more details 
of this important application see Haykin [17]. Next we consider the development of 
the “modern” approach to Bayesian processor design using the unscented Bayesian 
processor of Chapter 6. 
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8.3.2 Modern Joint Bayesian Processor 

The modern unscented processor offers a similar representation as the extended pro¬ 
cessor detailed in the previous subsection. Here we briefly outline its structure for 
solution to the joint problem and apply it to the trajectory estimation problem for 
comparison. We again start with the augmented state vector defined initially by 
sigma-points, that is. 


X(f) := 


x(t) 

m 


(8.30) 


X e R (N *+ N e) x] and xeR NxXl , 0 £ R NflX 1 . Substituting this augmented state vector 
into the SPBP relations of Table 6.1 yields the desired processor. We again draw the 
equivalences (as before): 


Pr (y{t)\x(t),Q(f» ~ N(c[x(t\e(t)],R vv (t)) 

Pr(*(f)|F,-i) ~ Af(X(t\t - l),V(t\t - 1)) (8.31) 

PrCKOIF, i) ~ MOoWt-nReoeoit)) 


where 


X(t\t — 1) := [x(t\t — 1) | 6(t\t — 1)]' 
ye(t\t - 1) := c[x(t\t ~ l),m ~ l)] 

" Pxx(.t\t-1) | P xe (t\t-l) 

V{t\t- 1 ) := - 

_PoMt-\) I Pee(t\t-\l} 

1l eeee {t) := C[X(t\t - l)]P(t\t - l)C[X(t\t - l)]'+ R vv (t) 


With this in mind it is possible to derive the internal structure of the SPBP in a manner 
similar to that of the XBP. But we will not pursue that derivation here. We just note 
that the sigma-points are also augmented to give 



■ m - 1) - 

( 

Xj := 

_ m - 1)_ 

+ 1 (N x + K ) 


PxAt\t~ 1) I Px(Mt - 1) 
P,h(t\t - I) | Pw(t\t-1\ 
with the corresponding process noise covariance partitioned as 

v x (t- 1) I o 


n ww (t - 1) := 


(8.33) 
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It also follows that the prediction-step becomes 

" a[x(t\t - 1), m - 1)] + b[9(t\t - 1), u{t - 1)]" 

Xj(t\t — 1) = - (8.34) 

a[6(t\t - 1)] 

and in the multichannel (vector) measurement case we have that 

" c[x(t\t ~ V,m ~ 1)] 

y t (t\t - 1) = - (8.35) 

c[d(t\t - 1)] 

Using the augmented state vector, we apply the “joint” approach to the trajectory 
estimation problem [20] and compare its performance to that of the XBP. 

Example 8.2 

Using the discrete nonlinear trajectory estimation problem of the previous example 
with unknown coefficients as before, we define the augmented sigma-point vector 
X{t) defined above and apply the SPBP algorithm of Table 6.1. 

The process equations for this problem are: 

x(t ) = (1 - 9\AT)x(t - 1) + 9 2 x 2 (t - 1) + w{t - 1) 

with the identical measurement and covariances as before. The true parameters are: 
©tme = [0.05 0.04]'. The SPBP can be applied to this problem by defining the param¬ 
eter vector as a constant and augmenting it to form the new sigma-point vector 
X = [x' 6\ 02]'- Therefore the process model becomes 

'(1 - 0\(t - 1)A T)x(t - 1) + e 2 (t - 1)A Tx 2 (t - 1)" 

9i(t-l) +w(t-\) 

0i (t - 1) 

y(t) = x 2 (t) + x\t) + v(t) 

To implement the SPBP the sigma-points are selected as before for a Gaussian dis¬ 
tribution using the scaled transformation with a — 1, k — 0 and /3 = 2 using the same 
initial conditions as the XBP. 

Using MATLAB [18], the A BSP is applied to solve this problem for 1500 sam¬ 
ples with AT = 0.01 sec. The results of the SPBP run are shown in Fig. 8.3. We 
see the estimated state and parameter estimates in a and b. After a short transient, 
the state estimate begins tracking the true state as evidenced by the predicted mea¬ 
surement and innovations sequence in Fig. 8.3c. The parameter estimates converge 
to their true values as evidenced by the plots. The final estimates are: Q\ =0.05; 
02 = 0.04. The processor appears to converge much faster than the XBP demonstrating 
its improved capability. This is implied from the zero-mean/whiteness tests shown in c. 
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FIGURE 8.3 SPBP (UKF) simulation, (a) Estimated state and parameter no. 1. 
(b) Estimated parameter no. 2 and innovation, (c) Predicted measurement and 
zero-mean/whiteness test (0.0113 < 0.0612 and 1.86% out). 

The innovations are statistically white (1.86% out) and zero-mean (0.0113 < 0.0612). 
The filtered measurement is also shown in c as well. This completes the ABSP 
example. AAA 

We also note in closing that a “dual” rather than “joint” approach has evolved 
in the literature. Originally developed as a bootstrap approach, it is constructed by 
two individual (decoupled) processors: one for the state estimator and one for the 
parameter estimator which pass updated estimates back and forth to one another as 
they become available. This is a suboptimal methodology, but appears to perform 
quite well (see [22, 23] for more details). This completes our discussion of the joint 
modern approach, next we investigate the SMC approach to solving the joint problem. 

8.4 PARTICLE-BASED JOINT BAYESIAN STATE/PARAMETRIC 
PROCESSORS 

In this section we briefly develop the sequential Monte Carlo approach to solving 
the joint state/parametric processing problem. It is not surprising that the resulting 
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particle filtering technique does not perform very well especially for a non-dynamic 
or static parameter. In fact, after the initial time, if left alone, the PF will just assign 
the initial weight as unity and proceed to use the initial parameter estimate for the 
entire trajectory. As discussed previously for the state estimation problem, this occurs 
because the parameter has no mechanism to explore the associated parameter space 
for the optimal solution. 3 The most obvious solution to this particular problem is 
to artificially assign a random walk (pseudo dynamic) model with small variance to 
approximate small variations in the parameter forcing it to vary, that is, 

0(t) = 9(t - 1) + w e (t - 1) for w e ~ .AA(0, R wewe ) (8.36) 

The process noise variance bounds the excursions of the random walk or equivalently 
the parameter space. Using artificial dynamics is the identical approach used for 
both the classical ( XBP ) and modem ( SPBP ) techniques used in the previous section. 
The only problem results when the parameter is truly static such as in financial models 
and its variations can have very large repercussions in the money and economic 
markets. Thus, a large variety of “off-line” MC methods have evolved [16], but we 
will not discuss them here since we are primarily interested in physical systems which 
typically have parametric uncertainties that are well modeled by the random walk or 
other variations. Of course, if the parametric relations are truly dynamic, then the joint 
approach incorporates parameter estimation and yields an optimal filtering solution 
to this joint problem. 

Here we are concerned with the joint estimation problem consisting of setting 
a prior for 6 and augmenting the state vector to solve the joint estimation problem 
as defined in Sec. 8.2 thereby converting the parameter estimation problem to one 
of optimal filtering. Thus, confining our discussion to state-space models and the 
Bayesian equivalence we develop the Bayesian approach using the relations: 

x ~ A x (x(t)\x(t-l),d(t-l)) 
e ~ Ae mmt - 1 ),x(t - 1)) (8.37) 

y ~ C(y(t)\x(t),0(t)) 

Here we separate the state transition function into the individual vectors for illustrative 
purposes, but in reality (as we shall observe), they can be jointly coupled. The key 
idea is to develop the PF technique to estimate the joint posterior Pr (x(f), 0(t)\Y,) 
relying on the parametric posterior Pr (8(t) \ Y t ) in the Bayesian decomposition. We will 
follow the approach outlined in [16, 33] starting with the full posterior distributions 
and proceeding to the filtering distributions. 

Suppose it is possible to sample N p -particles, {X t (i), ©,(;')} for i = \.... ,N p 
from the joint posterior distribution where we define, X t := [x(0),... ,x(t)} and 
© f := [0(0),..., 9(t)}. Then the corresponding empirical approximation of the joint 
posterior is given by 


3 The idea of applying a particle filter to a problem that does not have much or any process noise is not 
practical, it is better to use other methods for this type of problem [6]. 
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N p 

Pr(X„ © ( | Y t ) « -L Y] S(X t - X t (i), 0, - 0 ,( 0 ) (8.38) 

N p ^ 

and it follows directly that the filtering posterior (see Chapter 2) is given by 

1 Np 

Pr(*(t), 6{t)\Y t ) S(x(t ^ ~ ^0 - $(0) (8.39) 

^ ;=i 

Unfortunately it is not possible to sample directly from the full joint posterior 
Pr (X t , ©,| Y t ) at any time t. However one approach to mitigate this problem is by 
using the importance sampling approach of Chapter 2. 

Suppose we define a (full) importance distribution, q(X t ,& t \Y t ) such that 
Pr (X t , ©,|7,) > 0 implies q(-) > 0, then we define the corresponding importance 
weight (as before) by 


W(X t , ©,) a 


Pr (X t , 0, | Y t ) 


q(X t ,@ t \Y t ) 

From Bayes’ rule we have that the posterior can be expressed as 
Pr (Y t \X t , ©,) x Pr (X t , © f ) 


Pr (X t , &t\Y t ) = 


Pr (If) 


Thus, if N p - -particles, {X,(i), ©,(;)}; i—l,... ,N p , can be generated from the 
importance distribution 


{x t (i), ®m -> q(X„ ©,| Yt) (8.42) 

then the empirical distribution can be estimated and the resulting normalized weights 
specified by 


w, (■•). /«<■>. e,(0) 
Eii 


for i = 1 ,...,N P 


(8.43) 


to give the desired empirical distribution of Eq. 8.38 leading to the corresponding 
filtering distribution of Eq. 8.39. 

If we have a state transition model available, then for a fixed parameter estimate 
the state estimation problem is easily solved as before in Chapter 7. Therefore, we 
will confine our discussion to the parameter posterior distribution estimation problem, 
that is, marginalizing the joint distribution with respect to the states (that have already 
been estimated) gives 


Pr (0,| Y t ) = J Pr (X t , 0,| Y y )dX t 


(8.44) 
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and it follows that 


W(©,) a 


Pr (Qf|F f ) 
q{®t\Y t ) 


(8.45) 


Assuming that a set of particles can be generated from the importance distribution 
©t ~ q(& t \Y t ), then we have the set of normalized weights 


W,(©(/)) = |jj (0 '^ — for i = \,...,Np (8.46) 

TZx 


which is the joint “batch” importance sampling solution when coupled with the 
dynamic state vectors. 

The sequential form of this joint formulation follows directly (as before in Chap¬ 
ter 2). We start with the factored form of the importance distribution and focus on the 
full posterior Pr (@ f | Y t ), that is, 


q(®t\Y t )=Ylmk)\&k-i,Y k ) 

(t=0 


(8.47) 


with Pr (6>(0)| ©_i, Y t ) -* Pr (0(O)\Y,). 

Assuming that this factorization can be expressed recursively in terms of the 
previous step, q(@ t -\\Y t -\) and extracting the t -th term gives 


q(& t \Y t ) = qm)\®t-uY t ) x [~[ q m)\®k-uY k ) 

k=0 

or simply 

q(®t\Y t ) = qm)\®t-i,Y t ) x ^(© f _ 1 |T f _ 1 ) 

With this in mind the weight recursion becomes 

W(@ t ) = W(t ) x W(0,_i) 

Applying Bayes’ rule to the posterior, we define 

, = Pr (y(t)|©f, Y t -i) x Pr (9(t)\&(t — 1)) 

{) ' Pr(y(0|T/-i) x q(d(t)\® t -i,Y t ) 

Pr(y(t)|Q;, Tf-t) x Yr(9{t)\6(t— 1)) 

qm)\®,-l,Y t ) 


(8.48) 


(8.49) 


(8.50) 


(8.51) 


As before in the state estimation problem we must choose an importance 
distribution before we can construct the algorithm. We can choose from 
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the minimum variance (optimal) using local linearization techniques given by 
q(9(t)\@ t -i, Y t ) Pr (0(t)|0,_i, Y t ) which leads to the transition posterior 


rwrmio r\ Pr WOie.-i.r,-,) x Pr<e«)|»<, -1» 
p,<e«)|e,-..i',)--p, wr) |e,- 


(8.52) 


and therefore it follows using Eq. 8.51 that 


W MV (t) oc Pr (>■(?)!©,_,, K f _,) = J Pr(y(OI©„ Y t -\) x Pr(6(t)\0(t - 1 ))dO(t) 

(8.53) 


with (local linearization implementation of Chapter 7) 

Pr(y(f)|©r,rr-l) ~ M(y 0 {t\t - \\Reeee) 

which can be obtained from any of the classical or modem techniques (Chapters 5 
and 6). 

The usual bootstrap approach can also be selected as the importance distribu¬ 
tion leading to a simpler alternative with q{6{t)\ Q ( _i, Y t ) Pr (0(t)\0(t — 1)) and the 
weight of Eq. 8.51 becomes 

Wns(0 = Pr(y(0|© ! , Y, ,j) (8.54) 

From the pragmatic perspective, we must consider some practical approaches to 
implementing the processor for the j oint problem. The first approach, when applicable, 
(not financial problems) is to incorporate the random walk model of Eq. 8.36 when 
reasonable [16]. We will use this approach for our case study to follow. Another 
variant is to use the “roughening” method that moves the particles (after resampling) 
by adding a Gaussian sequence of specified variance. 4 

The kernel method (regularized PF of Chapter 7) can also be applied to the joint 
problem. In the bootstrap implementation we can estimate the posterior distribution 
using the kernel density approach, that is, after resampling we have the empirical 
distribution given by 


Pr(0(OI Y,^ l T mt) ~ m) (8.55) 

N p 

The kernel technique consists of substituting for this degenerate distribution, the 
kernel density estimate 


N p 

pr(0(oiT f ) = -L - m) 

N P i= 1 


(8.56) 


4 Recall that the sequence is distributed € ~ Afifi, icM^Np 1 ' w *) for k a constant and Afy the maximum 
distance between the i ,h and j th particles discussed previously of Sec. 7.5. 
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for/C(-) the kernel (e.g., Gaussian, triangle, etc.). Now anew set of parametric particles 
can be obtained by generating samples from 0,(f) ~ JC( 6 (t)). In the same manner as 
the standard bootstrap, this approach introduces diversity into the set of particles. 

Yet another alternative is to introduce the MCMC-step (see chapter 7) to “move” 
the particles to the regions of high probability using a Markov chain with appropriate 
invariant distribution. Again the new particles are sampled according to a MCMC 
with joint distribution (when possible) Pv(X t (i), 0,(01 Y,) such that 

(Mi), ©»(0) ~ ICmcmcM 0 f |y,) 

This completes the discussion of the joint state/parameter estimation problem using 
the PF approach, we emphasize that the algorithm of Chapter 7 may be used by merely 
augmenting the state vector with the parameter vector especially when a dynamic 
equation characterizing the parameter dynamics is available. Next we consider an 
example to illustrate this approach. 

Example 8.3 

Again we consider the trajectory estimation problem [20] using the particle filter 
technique. At first we applied the usual bootstrap technique and found what was 
expected, a collapse of the parameter particles giving an unsatisfactory result. Next 
we tried the “roughening” approach and the results were much more reasonable. We 
used a roughening factor or k = 5 x 10 -5 along with N p = 350. The results are shown 
below in Fig. 8.4. We see both the estimated states and measurement along with the 
associated zero-mean/whiteness test. The result, although not as good as the modem 
approach, is reasonable with the final estimates converging to the static parameter 
values of true parameters: 0\ = 0.05 (0.034) and 62 = 0.04 (0.039). The state and 
measurement estimates are quite good as evidenced by the zero-mean (0.002 < 0.061) 
and whiteness (1.76% out). We show the estimated posterior distributions for the 
states and parameters in Fig. 8.5 again demonstrating a reasonable solution. Note 
how the distributions are initially multi-modal and become unimodal as the parameter 
estimates converge to their true values as depicted in Fig. 8.5. AAA 

8.5 CASE STUDY: RANDOM TARGET TRACKING USING A SYNTHETIC 
APERTURE TOWED ARRAY 

Synthetic aperture processing is well-known in airborne radar, but not as familiar 
in sonar [35^10]. The underlying idea in creating a synthetic aperture is to increase 
the array length by motion, thereby, increasing spatial resolution (bearing) and gain 
in SNR. It has been shown that for stationary targets the motion induced bearing 
estimates have smaller variance than that of a stationary array [38, 41], Here we 
investigate the case of both array and target motion. We define the acoustic array 
space-time processing problem as: 

GIVEN a set of noisy pressure-field measurements from a horizontally towed array 
of L-sensors in motion, FIND the “best” (minimum error variance) estimate of the 
target bearings. 
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FIGURE 8.4 PF simulation, (a) Estimated state and parameter no. 1. (b) Estimated 
parameter no. 2 and innovation, (c) Predicted measurement and zero-mean/whiteness 
test (0.002 < 0.061 and 1.76% out). 


We use the following nonlinear pressure-field measurement model for M 
monochromatic plane wave targets characterized by a corresponding set of temporal 
frequencies, bearings, and amplitudes, [{<u m }, {0 m }, {a m }]. That is, 


M 

P(x, tk)=J2 a m e ^-^> s ” + n(t k ) (8.57) 


where fi{x, 4) •= k 0 x(t 0 ) + 1 14, k a = j L is the wavenumber, x(4) is the current spa¬ 
tial position along the x-axis in meters, v is the tow speed in m/sec, and n(4) 
is the additive random noise. The inclusion of motion in the generalized wave 
number, fi, is critical to the improvement of the processing, since the synthetic 
aperture effect is actually created through the motion itself and not simply the 
displacement. 
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FIGURE 8.5 PF posterior distribution estimation, (a) Estimated state posterior, (b) Parameter i 
posterior, (c) Parameter no. 2 posterior. 
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If we further assume that the single sensor equation above is expanded to include 
an array of L-sensors, x—*xt_,£ = \ ,... ,L; then we obtain 

M 

p(xiJk) = Y a m e iWmh - jf)<XeA)smem + ndt k ) 

M 

-* Y a m costae,t k - t k ) sin 6 m ) + n t (t k ) (8.58) 


since our hydrophone sensors measure the real part of the complex pressure-field, the 
final nonlinear measurement model of the system can be written in compact vector 
form as 


p(4) = c[4;0] + nfe) (8.59) 

where p, c, n e C Lx 1 , are the respective pressure-field, measurement and noise vectors 
and 0 e 1Z Mx 1 represents the target bearings. The corresponding vector measurement 
model 


M 

c^(4;0)= Y, a m cos(co m tic - fife, 4) sin 9 m ) for £ = 1 ,...,L 


Since we model the bearings as a random walk emulating random target motion, then 
the Markovian state-space model evolves from first differences as 

0(4) = 0(4-i) + w e (4-i) (8.60) 

Thus, the state-space model is linear with no explicit dynamics, therefore, the process 
matrix A —I (identity) and the relations are greatly simplified. 

Now let us see how a particle filter using the bootstrap approach can be con¬ 
structed according to the generic algorithm of Table 7.2. For this problem, we assume 
the additive noise sources are Gaussian, so we can compare results to the perfor¬ 
mance of the approximate processor. We define the discrete notation, 4 + i —»■ t+ 1 
for the sampled-data representation. 

Let us cast this problem into the sequential Bayesian framework, that is, we would 
like to estimate the instantaneous posterior filtering distribution, Pr(x(f)|F f ), using 
the PF representation to be able to perform inferences and extract the target bearing 
estimates. Therefore, we have that the transition probability is given by (0(t) —> x(t)) 

pimmt - 1)) Amwt - 1» ~ umo ■■ a[©(* - i)],r W8 , W9 ) 

or in terms of our state transition (bearings) model, we have 

0(0 = a[@(f— Y)] + we(t— 1) = 0(r-l)+w e (f-l) for Pr(w 0 (0) ~M0,R,,„,,„) 
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The corresponding likelihood is specified in terms of the measurement model 
(y(t) -*■ p(t)) as 

Pr(y(t)\x(t)) —► C(yit)\x{f))~M(yit): c[0(O],R w (O) 

where we have used the notation: z~J\A(z: m z ,f? zz ) to specify the Gaussian 
distribution in random vector z. In terms of our problem, we have that 

1 M 

In C(y(t)\x(t)) = k- -(y(t) - J] a m cos jco m t - fijt) sin O m ))'Rj 

M 

x (y(0 - am cos (oJmt ~ PO sin 


with k a constant, fi€TZ Lxl and Pit) := [fi{x \, t)\... \PIxl, t)]', the dynamic 
wavenumber expanded over the array. Thus, the SIR algorithm becomes: 

• Draw samples (particles) from the state transition distribution: 

©,(0 ~ a[©(f - I )],R, 

w e .{t) ~ Pr (wit)) ~ N((),R we . we .), ©,(t) = @iit - 1) + w e ,(t - 1) 

• Estimate the likelihood, C(y(f)| 0(f)) ~ M(y(t): c[©(?)], R„„(t)) with c/ : (t; ©,) = 

a m cos ico m tk — P(xt,t) sin © m ,/(0) for i = 1,..., L and @ m ,; is the i th - 
particle at the m th -bearing angle; 

• Update and normalize the weight: VV,(f) = !k,(f) 

• Resample: N e ff(t) < ^thresh 

• Estimate the instantaneous posterior: Pr(©(/)| Y t ) ~ W,(t)<5(©(f) — ©,(f)) 

• Perform the inference by estimating the corresponding statistics: 

0map(O = arg max Pr(©(t)| Y t ); © mmse (0 = E{Q(t)\ U} = E£i ©i(0Pr(©(01 U); 
©median (N = median(Pr(@(t)| T r )). 

Consider the following simulation of the synthetic aperture using a 4-element, 
linear towed array with “moving” targets using the following parameters: 

Target: Unity amplitudes with temporal frequency is 50 Hz, Wavelength = 30 m, 
Tow Speed = 5 m/sec; Array, four (4) element linear towed array with 15 m spac¬ 
ing; Particle filter: Ng = 4 states (bearings), N } > = 4 sensors, N = 250 samples, 
N p = 250 weights; SNR: is — lOdB; Noise: white Gaussian with: R ww = diag [2.5], 
R vv = diag [0.1414]; Sampling interval: is 0.005 sec; Initial Conditions: (bearings 
and uncertainty) © 0 = [45° —10° 5° —75°]', P 0 = diag (10 -10 ). 

The array simulation was executed and the targets moved according to a random 
walk specified by the process noise and sensor array measurements with —10 dB SNR. 
The results are shown in Fig. 8.6 where we see the noisy synthesized bearings (left) 
and four (4) noisy sensor measurements at the moving array. The bearing (state) 
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FIGURE 8.6 Synthetic aperture sonar tracking problem: Simulated target motion from initial 
bearings of 45°, -10°, 5° and -75° and array measurements (-10dB SNR'). 
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Time(sec) 


FIGURE 8.7 Particle filter bearing estimates for four targets in random motion: PF bearing 
(state) estimates and simulated target tracks ( UKF conditional mean, MAP). 

estimates are shown in Fig. 8.7 where we observe the targets making a variety of 
course alterations. The PF {MAP) is able to track the target motions quite well while 
we observe the unscented Kalman filter {UKF) [18] unable to respond quickly enough 
and finally losing track completely for target no. 4. It should be noted that targets no. 2 
and no. 4 “crossover” between 0.8 and 1.0 sec. The PF loses these tracks during this 
time period getting them confused but recovers by the 1 sec time step. Both the MAP 
and MMSE {CM) estimates using the estimated posterior provide excellent tracking. 
Note that these bearing inputs would provide the raw data for an XF-tracker [18]. 
The PF estimated or filtered measurements are shown in Fig. 8.8. As expected the 
PF tracks the measurement data quite well while the UKF is again in small error. 
Using the usual optimality tests for performance demonstrates that the PF processor 
works well, since each measurement channel is zero-mean and white with the WSSR 
lying below the threshold indicating a white innovations sequence demonstrating 
the tracking ability of the PF processor at least in a classical sense [18] as shown in 
Fig. 8.9. The instantaneous posterior distributions for the bearing estimates are shown 
in Fig. 8.10. Flere we see the Gaussian nature of the bearing estimates generated by 
the random walk. Clearly, the PF performs quite well for this problem. Note also 
the capability of using the synthetic aperture, since we have only a 4-element sensor 
array, yet we are able to track 4 targets. Linear array theory implies with a static array 
that we should only be able to track L— 1 = 3 targets! 

In this case study we have applied the bootstrap PF to an ocean acoustic synthetic 
aperture towed array target tracking problem to test the performance of a the particle 
filtering technique. The results are quite reasonable on this simulated data set. 
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FIGURE 8.8 Particle filter predicted measurement estimates for four channel hydro¬ 
phone sensor array. 
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FIGURE 8.9 Particle filter classical performance metrics: zero-mean/whiteness tests for 
45°, -10°, 5° and 75° targets as well the corresponding WSSR test. 
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8.6 SUMMARY 

In this chapter we have discussed the development of joint Bayesian state/parametric 
processors. Starting with a brief introduction, we defined the variety of problems 
based on the joint posterior distribution Pr(x(f),0(f)|F f ) and its decomposition. We 
decided to focus on the joint problem of estimating both states and parameters simul¬ 
taneously (on-line)—the most common problem of highest interest. We then briefly 
showed that all that is necessary for this problem is to define an “augmented” state 
consisting of the original state variables along with the unknown parameters typi¬ 
cally modeled by a random walk when a dynamic parametric model is not available. 
This casts the joint problem into an optimal filtering framework. We then showed 
how this augmentation leads to a decomposition of the classical ( EKF ) processor and 
developed the “decomposed” form for illustrative purposes. The algorithm is imple¬ 
mented by executing the usual processor with the new augmented state vector. We 
also extended this approach to both the modem “unscented” and “particle-based” pro¬ 
cessors, again only requiring the state augmentation procedure to implement. It was 
shown that all of the processors required a random walk parametric model to func¬ 
tion, while the particle filters could be implemented using the “roughening” (particle 
random walks) or any of the “move” techniques developed in Chapter 7 to track the 
parameters effectively. Besides applying these processors to the usual nonlinear tra¬ 
jectory estimation problem, we developed a case study for a synthetic aperture towed 
array and compared the modern to the particle-based processors. 


MATLAB NOTES 

SSPACK_PC is a 3 rd party toolbox in MATLAB that can be used to design 
model-based signal processors. This package incorporates the major nonlinear 
MBP algorithms discussed in this chapter—all implemented in the CD-factorized 
form [18] for stable and efficient calculations. It performs the discrete approxi¬ 
mate Gauss-Markov simulations using (SSNSIM) and both extended ( XMBP ) and 
iterated-extended ( IX-MBP ) processors using (SSNEST). The linearized model- 
based processor ( LZ-MBP ) is also implemented (SSLZEST) . Ensemble operations 
are seamlessly embodied within the GUI-driven framework where it is quite effi¬ 
cient to perform multiple design runs and compare results. Of course, the heart 
of the package is the command or GUI-driven post-processor (SSPOST) which 
is used to analyze and display the results of the simulations and processing (see 
http://www.techni-soft.net for more details). 

REBEL is a recursive Bayesian estimation package in MATLAB available on the 
web, that performs similar operations including the new statistical-based unscented 
algorithms including the UKF including the unscented transformations. It also has 
included the new particle filter designs (see http://choosh.ece.ogi.edu/rebel for 
more details). 




328 JOINT BAYESIAN STATE/PARAMETRIC PROCESSORS 


REFERENCES 

1. L. Ljung, System Identification: Theory for the User (Englewood Cliffs, NJ: Prentice-Hall, 
1987). 

2. L. Ljung and T. Soderstrom, Theory and Practice of Recursive Identification (Boston: 
MIT Press, 1983). 

3. T. Soderstrom and P. Stoica, System Identification (Englewood Cliffs, NJ: Prentice-Hall, 
1989). 

4. J. Norton, An Introduction to Identification (New York: Academic Press, 1986). 

5. J. Liu, Monte Carlo Strategies in Scientific Computing , (New York: Springer-Verlag, 
2001 ). 

6. O. Cappe, E. Moulines and T. Ryden, Inference in Hidden Markov Models, (New York: 
Springer-Verlag, 2005). 

7. J. Liu and M. West, “Combined parameter and state estimation in simulation-based fil¬ 
tering,” in Sequential Monte Carlo Methods in Practice (A. Doucet, N. de Freitas and 
N. Gordon) pp. 197-223 (New York: Springer-Verlag, 2001). 

8. A. Doucet, N. de Freitas and N. Gordon, Sequential Monte Carlo Methods in Practice 
(New York: Springer-Verlag, 2001). 

9. S. Godsill and P. Djuric, “Special Issue: Monte Carlo methods for statistical signal 
processing.” IEEE Trans. Signal Proc., 50, 173—499, 2002. 

10. O. Cappe, S. Godsill and E. Moulines, “An overview of existing methods and recent 
advances in sequential Monte Carlo,” Proc. IEEE, 95, 5, 899-924, 2007. 

11. G. Kitagawa and W. Gersch, Smoothness Priors Analysis of Time Series (New York: 
Springer-Verlag, 1997). 

12. G. Kitagawa, “Self-organizing state space model,” J. Am. Statistical Assoc., 97, 447, 
1207-1215, 1998. 

13. R. van der Merwe, A. Doucet, N. de Freitas and E. Wan, “The unscented particle filter,” 
in Advances in Neural Information Processing Systems 16 (Cambridge, MA: MIT Press, 
2000 ). 

14. N. Gordon, D. Salmond and A. Smith, “A novel approach to nonlinear non-gaussian 
Bayesian state estimation,” lEEProc. F., 140, 107-113, 1993. 

15. S. Haykin and N. de Freitas, “Special Issue: Sequential state estimation: from Kalman 
filters to particle filters.” Proc. IEEE, 92, 3, 399-574, 2004. 

16. C. Andrieu, A. Doucet, S. Singh, and V. Tadic, “Particle methods for change detection, 
system identification and control,” Proc. IEEE, 92, 6, 423-468, 2004. 

17. S. Haykin, Kalman Filtering and Neural Networks. (New York: John Wiley, 2001). 

18. J. Candy, Model-Based Signal Processing. (Hoboken, NJ: John Wiley/IEEE Press, 2006). 

19. D. Simon, Optimal State Estimation: Kalman and Nonlinear Approaches (Hoboken, 
NJ: John Wiley/IEEE Press, 2006). 

20. A. Jazwinski, Stochastic Processes and Filtering Theory. (New York: Academic Press, 
1970). 

21. L. Ljung, “Asymptotic behavior of the extended Kalman filter as a parameter estimator 
for linear systems,” IEEE Trans. Auto. Control, AC-24, 36-50, 1979. 

22. R. van der Merwe, Sigma-Point Kalman Filters for Probabilistic Inference in Dynamic 
State-Space Models OGI School of Science & Engr., Oregon Health & Science Univ., 
Ph.D. Dissertation, 2004. 



REFERENCES 329 


23. A. Nelson, Nonlinear Estimation and Modeling of Noisy Time-Series by Dual Kalman 
Filtering Methods OGI School of Science & Engr., Oregon Health & Science Univ., Ph.D. 
Dissertation, 2000. 

24. J. Candy, “Bootstrap particle filtering,” IEEE Signal Proc. Magz., 24, 4, 73-85, 
2007. 

25. J. Rajan, R Rayner and S. Godsill, “Bayesian approach to parameter estimation and 
interpolation of time-varying autoregressive processes using the Gibhs sampler,” IEE 
Proc-Vis. Image Signal Process., 144, 4, 249-256, 1997. 

26. N. Poison, J. Stroud and P. Muller, “Practical filtering with sequential parameter learning,” 
Univ. Chicago Tech. Rpt., 1-18, 2002. 

27. C. Andrieu, A. Doucet, S. Singh and V. Tadic, “Particle methods for change detection, 
system identification and control,” Proc. IEEE, 92, 3, 423-438, 2004. 

28. G. Storvik, “Particle filters in state space models with the presence of unknown static 
parameters,” IEEE Tran. Signal Proc., 50, 2, 281-289, 2002. 

29. P. Djuric, “Sequential estimation of signals under model uncertainty,” in Sequential 
Monte Carlo Methods in Practice (A. Doucet, N. de Freitas and N. Gordon) pp. 381-400 
(New York: Springer-Verlag, 2001). 

30. D. Lee and N. Chia, “A particle algorithm for sequential Bayesian parameter estimation 
and model selection,” IEEE Tran. Signal Proc., 50, 2, 326-336, 2002. 

31. A. Doucet and V. Tadic, “Parameter estimation in general state-space models using particle 
methods,” Ann. Inst. Stat. Math., 55, 2, 409-422, 2003. 

32. C. Andrieu, A. Doucet and V. Tadic, “On-line parameter estimation in general state-space 
models,” Proc. IEEE Conf. Decision and Control, pp. 332-337, 2005. 

33. J. Vermaak, C. Andrieu, A. Doucet and S. Godsill, “Particle methods for Bayesian model¬ 
ing and enhancement of speech signals,” IEEE Trans. Speech Audio Proc., 10, 3,173-185, 
2002. 

34. T. Schoen and F. Gustafsson, “Particle filters for system identification of state-space 
models linear in either parameters or states,” Linkoping University Report, LITH-ISY-R- 
2518, 2003. 

35. R. Williams, “Creating an acoustic synthetic aperture in the ocean,” J. Acoust. Soc. Am., 
60, 60-73,1976. 

36. N. Yen and W. Carey, “Applications of synthetic aperture processing to towed array data,” 
J. Acoust. Soc. Am., 60, 764-775, 1976. 

37. S. Stergiopoulus and E. Sullivan, “Extended towed array processing by an overlap 
correlator,” J. Acoust. Soc. Am., 86, 764-775, 1976. 

38. E. Sullivan, W. Carey and S. Stergiopoulus, “Editorial in special issue on acoustic synthetic 
aperture processing,” IEEE J. Ocean. Eng., 17, 1-7, 1992. 

39. D. Ward, E. Lehmann and R. Williamson, “Particle filtering algorithm for tracking and 
acoustic source in a reverberant environment,” IEEE Trans. Speech and Aud. Proc., 11, 
6, 826-836, 2003. 

40. M. Orton and W. Fitzgerald, “Bayesian approach to tracking multiple targets using sensor 
arrays and particle filters,” IEEE Trans. Signal Proc., 50, 2, 216-223, 2002. 

41. E. Sullivan and J. Candy, “Space-time array processing: The model-based approach,” 
J. Acoust. Soc. Am., 102, 5, 2809-2820, 1997. 



330 JOINT BAYESIAN STATE/PARAMETRIC PROCESSORS 


PROBLEMS 

8.1 Suppose we are given the following innovations model (in steady state) 

x(t) = ax(t — 1) + ke{t — 1) 
y(t) = cx{t ) + e(t) 

where e(t ) is the zero-mean, white innovations sequence with covariance, R ee . 

(a) Derive the Wiener solution using the spectral factorization method of 
Sec. 4.5. 

(b) Develop the linear steady-state BP for this model. 

(c) Develop the parametrically adaptive processor to estimate k and R ee . 

8.2 As stated in the chapter, the XBP convergence can be improved by incorporat¬ 
ing a gain gradient term in the system Jacobian matrices, that is, 

A* e [x, e] := A 0 [x, e ] + [VffK x (Q)]e(t) for K := [K x \ K e ] 

(a) By partitioning the original N x x Ng Jacobian matrix, Ag[x, 0 ], derive the 
general “elemental” recursion, that is, show that 

Ny 

A* e [i,l ] - Vff ( ai[x,0] + Vff t k x (i,j) ejityj =].... ,N xi € = 1. N 0 

j= l 

( b ) Suppose we would like to implement this modification, does there exist a 
numerical solution that could be used? If so, describe it. 

8.3 Using the following scalar Gauss-Markov model 

x(t ) = Ax(t - 1) + w(t - 1) 
yCO = Cx(t) + v(t) 

with the usual zero-mean, R ww and R vv covariances. 

(a) Let [A, C, K, R ee ] be scalars, develop the ABSP solution to estimate A from 
noisy data. 

(b) Can these algorithms be combined to “tune” the resulting hybrid 
processor? 

8.4 Suppose we are given the following structural model 

mx{i) + cx + kx(t ) = pit) + w(t ) 

y(t) = pm+m 

with the usual zero-mean, R ww and R vv covariances. 
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(a) Convert this model to discrete-time using first differences. Using cen¬ 
tral difference create the discrete Gauss-Markov model. (Hint: x(t) ~ 
m-m- i)+*d-2h 

-A 2 -'• 

(b) Suppose we would like to estimate the spring constant k from noisy 
displacement measurements, develop the ABSP to solve this problem. 

(c) Transform the discrete Gauss-Markov model to the innovations represen¬ 
tation. (Hint: Use the KSP equations of [18]). 

(d) Solve the parameter estimation problem using the innovations model, that 
is, develop the estimator of the spring constant. 


8.5 Given the ARMAX model 


y(t) = —ay(t - 1) + bu(t - 1) + e(t) 
with innovations covariance, R ee : 

(a) Write the expressions for the ABSP in terms of the ARMAX model. 

(b) Write the expressions for the ABSP in terms of the state-space model. 

8.6 Consider tracking a body falling freely through the atmosphere [18], We 
assume it is falling down in a straight line towards a radar. The state vector 
is defined by: x:=[z z /3] where ft ~ Af(/ip, Rpp) = J\f (2000, 2.5 x 10 5 ) is 
the ballistic coefficient. The dynamics are defined by the state equations 


x\(t) = x 2 (t) 


m(t) = 


px\(t) 
2x 3 (t) 


x 3 (t) = 0 


where d is the drag deceleration, g is the acceleration of gravity (32.2), p is 
the atmospheric density (with p a (3.4 x 10 -3 ) density at sea level) and k p a 
decay constant (2.2 x 10 4 ). The corresponding measurement is given by: 

y(t) = x\(t) + v(t) 

for v ~ N (0, R vv ) = N (0,100). Initial values are x(0) = p ~AA(1065, 500), 
x(0) -Aft—6000,2 x 10 4 ) andP(0) = diag[p 0 (l, \),p a (2,2),p a (3, 3)] = [500, 
2x 10 4 , 2.5 x 10 5 ). 

(a) Is this an ABSP If so, write out the explicit algorithm in terms of the 
parametrically adaptive algorithm of this chapter. 

(b) Develop the XBP for this problem and perform the discrete simulation 
using MATLAB. 
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(c) Develop the LZ-BP for this problem and perform the discrete simulation 
using MATLAB. 

(d) Develop the PF for this problem and perform the discrete simulation using 
MATLAB. 

8.7 Parameter estimation can be performed directly when we are given a nonlinear 
measurement system such that 


y = h(0) + V 

where y,he 1Z N > x 1 and 9^'Af(mg,R se ) and v ~Af(0, R vv ). 

(a) From the a posteriori density, Pr (0|y) derive the MAP estimator for 9. 

(b) Expand y = h(9) in a Taylor series about 0 o and incorporate the first order 
approximation into the MAP estimator (approximate). 

(c) Expand y = h(0) in a Taylor series about 9„ and incorporate the second 
order approximation into the MAP estimator (approximate). 

(c) Develop and iterated version of both estimators in (b) and (c). How do 
they compare? 

(d) Use the parametrically adaptive formulation of this problem assuming the 
measurement model is time-varying. Construct the ABSP assuming that 
9 is modeled by a random walk. How does this processor compare to the 
iterated versions? 

8.8 Suppose we are asked to solve a detection problem, that is we must “decide” 

whether a signal is present or not according to the following binary hypothesis 

test 


H 0 : y(t) = v(t) for v ~ AT( 0, R vv ) 

= s(t) + v(l) 

The signal is modeled by a Gauss-Markov model 

i(f) = a[s{t - 1)] + w(t - 1) for w ~ Af(0,R ww ) 


(a) Calculate the likelihood-ratio defined by 


C{Y{N)) := 


Pr (Y(N)l'Hi) 
Pv(Y(N)\H 0 ) 


where the measurement data set is defined by Y(N)\ = (y(0),y(l),... ,y(Af)}. 
Calculate the corresponding threshold and construct the detector (binary 
hypothesis test). 
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(b) Suppose there is an unknown but deterministic parameter in the signal 
model, that is, 

s(t) = a[s(t-iy,9(t-l)] + w(t-l) 

Construct the “composite” likelihood ratio for this case. Calculate the cor¬ 
responding threshold and construct the detector (binary hypothesis test). 
(Hint: Use the ABSP to jointly estimate the signal and parameter.) 

(c) Calculate a sequential form of the likelihood ratio above by letting the 
batch of measurements, N —> t. Calculate the corresponding threshold 
and construct the detector (binary hypothesis test). Note there are two 
thresholds for this type of detector. 

8.9 Angle modulated communications including both frequency modulation (FM) 
and phase modulation (PM) are basically nonlinear systems from the model- 
based perspective. They are characterized by high bandwidth requirements and 
their performance is outstanding in noisy environments. Both can be captured 
by the transmitted measurement model: 

s(t) = V2P sin [co c t + k p m(tj\ (PM) 
or 

s(t ) = \flP sin [co c t + 2nkf J m(r)dr] (FM) 

where P is a constant, w c is the carrier frequency, k p and kf are the deviation 
constants for the respective modulation systems and of course, m(t), is the 
message model. Demodulation to extract the message from the transmission 
is accomplished by estimating the phase of s(t). For FM, the recovered phase 
is differentiated and scaled to extract the message, while PM only requires the 
scaling. 

Suppose the message signal is given by the Gauss-Markov representation 

m(t) = —am(t — 1) + w(t — 1) 
y(t) = s(t ) + v(t) 

with both w and v zero-mean, Gaussian with variances, R ww and R vv . 

(a) Construct a receiver for the PM system using the XBP design. 

(b) Construct an equivalent receiver for the FM system. 

(c) Assume that the message amplitude parameter a is unknown, construct 
the ABSP receiver for the PM system to jointly estimate the message and 
parameter. 

(d) Under the same assumptions as (c), construct the ABSP receiver for the 
FM system to jointly estimate the message and parameter. 
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(e) Compare the receivers for both systems. What are their similarities and 
differences? 

8.10 We are given the population model of the Chapter 7 case study and would like 
to “parameterize” it for adaptive processing, since we know the parameters 
are not very well known. The state transition and corresponding measurement 
model are given by 

x(t) = A -x(t - 1) + 1 2 | 5 ^_ 1 ' > 1) + 8 cos (1.2(1 - 1)) + w(t - 1) 
x 2 (t) 

y (t) = ^ + 

where At = 1.0, iu~JV r (0,10) and v ~AA(0,1). The initial state is Gaussian 
distributed with x(0) ~ Af( 0.1,5). 

In terms of the nonlinear state-space representation, we have 


a[x(t - 1)] 
b[u(t - 1)] 
c[x(t )] 


\x{t - 1) + 


/ 25x(t — 1) \ 
Vl+x 2 (f-l)J 


8cos(1.2(t — 1)) 
xHt) 

20 


(a) Choose the model constants: 25, 8, 0.5 and ^ as the unknown parame¬ 
ters, reformulate the state estimation problem as a parameter estimation 
problem with unknown parameter vector, © and a random walk model 
with corresponding process noise variance, R ww = diag[ 1 x 10 -6 ]. 

(b) Develop the joint SPBP algorithm to solve this problem. Run the SPBP 
algorithm and discuss the performance results. 

(c) Develop the joint PF algorithm to solve this problem. Run the PF 
algorithm and discuss the performance results. 

(d) Choose to “move” the particles using the roughening approach, how do 
these results compare to the standard bootstrap algorithm? 

(e) Develop the joint linearized ( UKF ) PF algorithm to solve this problem. 
Run this PF algorithm and discuss the performance results. 
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DISCRETE HIDDEN MARKOV 
MODEL BAYESIAN 
PROCESSORS 


9.1 INTRODUCTION 

In this chapter we develop discrete (event) hidden Markov models. All of the Bayesian 
processors we have discussed are, in fact, hidden Markov processors, since the internal 
states are usually not measured directly and are therefore “hidden” by definition, but 
the distinguishing factor is the type of underlying process governing the sequence. In 
fact, the (state) transition matrix is a “probability” matrix with specific properties that 
distinguish it uniquely from other dynamic systems. These discrete representations of 
stochastic processes find enormous application in the speech, economics, biomedical, 
communications and music areas where coding approaches are prevalent. We discuss 
the development of the basic processor and investigate a case study in communications 
to demonstrate the design and application. 


9.2 HIDDEN MARKOV MODELS 

A discrete-time hidden Markov model ( HMM) is a stochastic representation (model) 
of a process that can be used for simulation, modeling and estimation much the same 
as the state-space model is used for dynamic (physical) systems. These models are 
prevalent in acoustics, biosciences, climatology, control, communications, econo¬ 
metrics, text recognition, image processing, signal processing and speech processing 
[1], Perhaps its distinguishing feature is that it is a “probabilistic model” in the sense 
that it is driven by internal probability distributions for both states and observations 
or equivalently measurements. Here the state transition matrix prevalent in linear 
systems theory is still valid and also called a transition matrix, but it is a discrete 
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probability matrix with rows summing to unity and in some cases (doubly stochastic) 
columns summing to unity as well. The underlying structure from which the HMM 
evolves is the Markov chain of Chapter 3 along with the sequential Bayesian recur¬ 
sions of Chapter 2. We start with the idea of a Markov chain and its decomposition 
basics leading to HMM. 

9.2.1 Discrete-Time Markov Chains 

A discrete-time Markov chain is characterized by a state variable that changes at 
certain time instances [2-4]. At each time-step t the state is defined by x(t) e X (state- 
space) and X = [X\ ,.. X ^ x }. The probability that at time t the chain occupies state 
i is defined by Pr(x,(t)). The dynamics of the Markov chain are represented by its 
transition probability, a mn (t — 1, t) := Pr(x n (t)\x m (t — 1)) where x,(f) := {x(t) = Xj}. 
This expression means that the probability that the state at time t is X n given that 
it is currently in state X m at time t — I for (X m , X n ) e X. Here the key Markovian 
assumption is that the transition probability a mn applies whenever state X m is “visited” 
independent of the “past” and the path or previous states employed to reach X m . This 
is merely a statement of the Markovian property that 

a mn (t - 1, t) — Pr(x„(t)\x m (t - 1),.. . ,X((0)) = Pr(x n (t)\x m (t - 1)) 

V t and (X m , X n ) e X (9.1) 

Further, if the chain is homogeneous-in-time, then a mn (t — 1, t) depends only on the 
time difference (in general) and therefore the transition probability is stationary such 
that a mn (t — l,f)—► a m „ with a mn > 0 and YHU a mn = 1 12]. 

Summarizing, a discrete-time Markov chain is characterized by: 

• a finite set of known /V A -states: X = [X ],..., Xpj x }; 

• a non-negative set of .state-transition probabilities for (X m , X n ) —>■ {a m „}; and 

• a sequence of random variables: x m (0),x n (l),... e X that satisfy the Markovian 
property: 

a mn (t — 1,0 = Pr(x n (t)\x m (t - 1)) V t and all states (X m , X n ) e X 

The elements of the homogeneous chain are embedded in the N x x N x state tran¬ 
sition probability matrix, A= [a mn ]; m,n= 1,..., N x . The chain can be specified 
pictorially by a directed graph with nodes representing states and arcs or arrows 
corresponding to the transition probabilities as illustrated in Fig. 9.1. 

Example 9.1 

Suppose we are given a two-state (N x = 2) Markov chain with transition probability 
a mn = Pr(x n {t)\x m {t - 1)); m,n= 1,2 
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An 


FIGURE 9.1 Directed graph representation of two-state Markov chain. 


and a mn = {0.55,0.45,0.25,0.75}. Construct the state-transition probability matrix A 
and the corresponding directed graph. The transition probability is given by: 


A = 


'0.55 

0.25 


0.45] 

0.75J 


The resulting graph is shown in Fig. 9.1. 


AAA 


9.2.2 Hidden Markov Chains 

A hidden Markov model ( HMM) is simply a Markov chain in which all of the states 
are not observed—some are hidden. In this case we introduce the observation or 
measurement or output process where only a subset of the states are observed directly. 
Thus, the essential difference between a Markov chain and a hidden Markov model 
is that for a HMM there is not a one-to-one correspondence between the states and 
observations (output measurements). It is not possible to tell which state the model 
was in by merely observing the outputs of the chain. We illustrate the structural model 
in Fig. 9.2a. Note that when the states are directly observed in a, then the observations 
and states are identical. 

Thus, we differentiate the hidden Markov chain or HMM from the Markov chain by 
introducing an observation or measurement process [5-8], Here the state sequence 
is not known, that is, it is hidden in the measurement sequence. Thus, at every 
time-step t, the system generates a measurement or observation y(t) according to a 
probability distribution that depends on the state, x(t). The number of observations 
N y corresponds to a known distinct set, that is, at time t the observation is y(t) e y 
(observation space) with y = {{Fi, ■.., OV,,}- We define the corresponding “discrete” 
observation probability (likelihood) by 1 : 

c u (t,t)-.= Vr(y t (t)\Xk(.t)) (9.2) 

1 This probability expression has two subscripts to annotate the discrete state (x*(f)) and the discrete 
measurement or observation (ye(t)). Most references assume a continuous observation and use the notation 
c*(y(0) [9-12], 
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Initial conditions 
(prior) 



FIGURE 9.2 Hidden Markov model structure: (a) Markov chain (states) and observations 
(measurements), (b) Markov chain with state-transition probabilities and observations 
with measurement probabilities (likelihoods). 


Again for the homogeneous case, Cki(t, t) —> cu and cu > 0 and c ki = 1- As 
in the chain, we have the associated N y x N x observation probability matrix given 
by C = [c u \ for k = 1,... ,N X \ l =\....,N y with C e R N y xN *. The final ingredi¬ 
ent to characterize the HMM is the prior or initial probability distribution given by 
Pr(x(0) = 0)); i=l,... ,N X which represents the initial probability of the chain. 

Summarizing a hidden Markov model is specified by the model £ := 
{A, C, Pr(x,(0))} (homogeneous case) where: 

• The state-transition probability matrix is: 

A = [a mn ] = Pr(x„(t)lx m (t - 1)); m, n = 1,... ,N X ; 

• The observation probability matrix is: 

C = [cm] = Pr(ytit)\xk(t)) for k = 1,... ,N X \ 1= 1__ N y and 

• The prior probability is: 


Pr0c,(0)); i=l,...,N x . 
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where N x , N y are the number of states and observations (measurements), respectively 
(see Fig. 9.2 b). 

We must realize a subtle point that in contrast to dynamic physical systems where 
the states and measurements can be any real value or number, the HMM states or 
observations can only assume pre-defined integer values, X = [Xi ,..., X^ x } and 
y = {y \,..., )—this is very important to comprehend. It is the transition and 
observation probabilities that drive the occurrence of an individual state and observa¬ 
tion event, since both are merely (integer) realizations of model outputs. For example, 
the mapping or quantization of a “real” physical communications signal to a binary 
coded representation takes the form of a sequence of a 0 or 1 integer value at each 
time-step which are mapped to the observation sequence (see Sec. 9.7). 

It should also be noted that with the addition of the observation process, it is simple 
to define an underlying HMM state-space model as: 

x(t) = Ax{t - 1) + wit - 1) 

m = Cxit) + vit) (9.3) 

where w, v are uncorrelated (white) sequences (noise) with x, y the usual state and 
measurement sequences and associated initial conditions x(0), all specified by the 
HMM above of Fig. 9.2 b. 

Note that the additive sequences (noise) are not necessarily Gaussian and therefore 
the linear Bayesian processor (Kalman filter) is not optimum in this case. However, 
it has been shown [13] that under some (sufficient) conditions (stationarity, etc.) 
that an optimal minimum error variance estimator (Kalman filter) can be constructed 
based on a stochastic realization of a HMM. The results of this design are capable 
of providing reasonable estimates of the HMM states and observations. It is also 
important to understand that exists state-space representations in which both dis¬ 
crete HMM models are combined with dynamic (physical) state-space systems. For 
instance, one prevalent form is termed switching models in which the discrete HMM 
determines which underlying dynamic model applies at a given time-step. This is an 
approach frequently used in target tracking problems [14] to provide multiple model 
choices. 


9.3 PROPERTIES OF THE HIDDEN MARKOV MODEL 

In this section we investigate some of the underlying probabilistic properties of the 
discrete HMM. To no surprise it matches our Bayesian processor development of 
the previous chapters. After all, once placed in the state-space representation, all 
Bayesian properties should hold. We start with the joint distribution. 

The HMM is a probabilistic model of the joint collection of random variates (Y,,X t ). 
Critical properties of the HMM rely on basic Bayesian methodologies and develop¬ 
ments. Perhaps the most useful notions inherited from the Markov chain theory are the 
two major properties of conditional independence that are used over and over again 
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along with Bayes’ rule. That is, under a HMM there are two assumptions enabling the 
development of the underlying techniques: 

1. The hidden variables are first-order Markov 2 : Pr(x(t)|Z f _i, F ( _i) = 
Pr(x(f)|x(f — 1)) (state-transition); and 

2. The observation is independent of other variates given the state (at time t ): 
Pr(y(t)\Y t -i,X t ) = Vr(y(t)\x(t)) (likelihood). 

These properties imply that the underlying joint distribution can be expanded as: 


Pr (Y t ,X t ) = Pr(y(f), Y t -\,x{t),X t -\) 

= Pr(y(f), x(t)\ F ? _i, X t _\) x Pr(K,_ (9.4) 

Continuing to apply Bayes’ rule to this expression gives 

Pr (Y t ,X t ) = [Pr(y(t)WO,^_i,T f _i)xPr(x(t)|X ? _i,T f _i)] 

x Pr(Y t -i f X t -i) (9.5) 


or finally 


Pr (Y t ,X t ) = Pr(y(t)\x(t)) x Pv(x(t)\x(t - 1)) x Pr(K, l ,X, ,) (9.6) 

where we have applied the conditional independence properties of the HMM. From 
the chain rule and these independence properties, we can expand this expression even 
further to obtain 


Pr(Tf, X t ) = f] Pr (y(k)\x(k)) x f[ Pv(x(k)\x(k - 1)) x Pr(x(0)) (9.7) 


Thus, in order to characterize a HMM we require the following probabilities: 

. Prior: Pr(x(0)); 

• Transition: Pr(x(t)|x(t — 1)); and 

• Likelihood: Pr(y(t)|x(f)). 

These quantities correspond to the classic 3 definition of a hidden Markov 
model [10]. The underlying Markov chain is usually assumed to be homogeneous- 
in-time with associated stochastic state transition matrix defined before by 
A = a,„ n = Pr(x„(t)|x m (t — 1)) Vt and the observation (measurement) probability is 


2 Any A'*-order Markov process can be transformed to a first-order process [15]. 

3 Classical notation: jr„ ->• Pr(x(0)); ay -*■ Pr(xj(t)\xi(t — 1)); and fi,(y(r))->■ Pr(y(f)|xi(t)) (continuous 
observation) or by -*■ Pr(yj(t)\xi(t)) (discrete observation). 
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(c) 


FIGURE 9.3 HMM basic problems: (a) Evaluation problem, (b) Sequence estimation 
problem, (c) Parameter estimation problem. 


given by C=\cu\ = ^(yi(t)\xkit)). The //MM-parameters are usually specified 
by £ = (A, C, Pr(a:,(0))). Here the measurements Y t are observed and the states 
or internal variables are hidden. In order to generate (simulate) samples from 
the HMM, the initial “state” distribution is generated followed by the likelihood 
(prior—> transition—> likelihood). It is important to understand that each mea¬ 
surement sample simulated requires new state samples, that is, two synthesized 
measurement samples originated from two different states in the hidden Markov chain. 


9.4 HMM OBSERVATION PROBABILITY: EVALUATION PROBLEM 

With these HMM properties available, we can now pose the first problem of interest. 
With the model known and a set of observations available, how can we evaluate the 
performance of the HMM to faithfully synthesize observations? One way to approach 
this problem is to estimate the corresponding observation probability Pr)?^) and use 
it to “validate” that the model and observations are compatible (see Fig. 9.3a). This 
approach is especially useful when we are to compare or “match” different models 
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to the same observation sequence and search for that model which provides the best 
match. Thus, calculating the total observation probability provides a solution to the 
evaluation problem of HMM, that is, 

GIVEN the observation sequence, TV and HMM parameters £, FIND the total 
observation probability, Pr(Tr) for Y T = (y(0),... ,y(T)}. 

The observation probability is obtained by marginalizing (summing over) the total 
probability 


Pr(f)) = J2 Pr (M) = E(n &(y(.k)\x(k)) x Y\ Pr(x(k)\x(k - 1)) x Pr(x(0))J 
x, x, \k =o k=2 ) 

(9.8) 

Blindly computing this summation is a very inefficient method to estimate the 
desired probability. Instead, we factor the states using their first-order Markov 
property (conditional independence) such that 

Pr(Tf) = J2 MY t ,x(t),x(t - 1)) (9.9) 

X(t),x(t- 1) 


but continuing the expansion over Y, we have 

Pr(Y t ,x(t),x(t — 1)) = Pv(Y t -\,x(t — Y),y(t),x(t)) 

= Pr(y(t), x(t)\Y,-i,x(t - 1)) x Pr(x(t - 1), i) 

or 

Pr (Y t ,x(t),x(t - 1)) = Pr(y(0|x(t), Y t -\, x(t - 1)) x Pr(x(t)\Y t _ u x(t - 1)) 
x Pr (x(t - 1), K, ,) 

which yields the final expression 

Pr(Y t ,x(t),x(t — 1)) = Pr(y(t)|x(f)) x Pr(x(t)\x(t - 1)) x Pr (Y t - U x(t- 1)) (9.10) 

Marginalizing, we obtain 
Pr (Y„x(t)) = J2 MY t Mx(t ~ I)) 

X(t- 1) 

= J2 Yv(y(t)\x(t)) x Pr(x(t)\x(t - 1)) x Pr (Y t _ u x(t - 1)) (9.11) 

x(t- 1) 
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Define the forward operator as J~x(t) '■= Pr(F f , x(t) = X), then rewriting Eq. 9.11 
we have 


T k (t) = Pr (Y„xk(t)) 

= PrCv(/)|-v-*(/))J]Pr(. % (/)|^(l - D) x Pr(F, u x t (f ~ D) 


or substituting for the previous time-step, we obtain the forward recursion for the 
HMM 

T k {t) = PrCy(OWO) E PrfeCOI X t (f - 1)) X Tt(t ~ D) (9.12) 

Now, if we assume a stationary chain, then A = [au], C = [cu\, and this result can 
be expressed in terms of transition probabilities simply as 

TM = x C U x T t {$ - 1) for c u = Pr(y e (T)\x k (T)) (9.13) 


Clearly marginalizing over x(t ) gives the total observation probability as 

pr(Er) = J2mn= E Pr (^, xm (9.14) 

* Xk(t) 

Thus, we have the forward recursion algorithm for HMM that can be used to obtain 
the total observation probability as: 

• Initialize: J^fO) = Pr(x*(0)) x c k0 ; 

• Recursion: Tift), = ffe au x cu x Tft - 1); 

• Termination: Pr(F7’)= Jf k TifT). 

This algorithm will be used not only to estimate the observation probability as 
the solution to the evaluation problem, but also to combine with another recursion to 
estimate model states and parameters. 

Before we close this section, consider the following example of simulating a HMM. 

Example 9.2 

Suppose we have a discrete binary signal with the two-states (N x = 2) specified 
by: {x\(t)= 1, xjit) = 2}. The observation is also discrete with N y = 3 and speci¬ 
fied by {yft) = 1, )’2(t) = 2,yft) = 3}. The transition and observation probabilities 
are given by: 

a mn = Vv{x n {t)\x m {t — 1)); m,n =1,2 and cy.,, = Vr(y k (t)\x n {t)y, k — 1,2,3 

with a mn = {0.6,0.4; 0.3,0.7} and c kn = {0.50,0.25,0.25; 0.35,0.25,0.40}. Construct 
the state transition probability matrix A and the corresponding directed graph. 
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State sequence 



Observation sequence 



(a) Hidden state simulation, (b) Observation simulation. Note that both states/ 
observations can assume only integer values governed by the transition and observation 
probabilities. 


The transition probability is given by: 


A = 


' 0.6 

0.3 


[0.50 

0.35 


0.25 

0.25 


0.25] 

0.40j 


The resulting directed graph is identical to that in Fig. 9.1. The transition probability 
implies that once a particular state is occupied, it is more than likely to remain in that 
state since an = 0.6 and a.22 = 0.7 rather than transition to the other states, a\2 = 0.4 
and 021 = 0.3. So we expect the state transitions to essentially have longer time-steps 
with fewer transitions. The observation probability on the other hand seems almost 
equally likely to transition with cn = 0.5 implying that when occupying state 1 it is 
most probable that the observation output will be 1 with C23 = 0.4 next, that is, if in 
state 2 then the output 3 is most likely. 

Using MATLAB a simulation was performed for 100 samples with the results 
shown in Fig. 9.4. The state transitions are shown in a and appears to conform the 
intuition afforded by the transition probability, while the observations also follow 
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as well. The evaluation problem can be solved by estimating the corresponding total 
observation probability which is In Yx(Yt) = —106 and then making additional runs 
with various HMM for comparison. AAA 

9.5 STATE ESTIMATION IN HMM\ THE VITERBI TECHNIQUE 

In this section we develop solutions to the state estimation problem for two cases of 
interest: (1) individual hidden state estimation, that is, the state estimate at a given 
time-step; and (2) entire sequence or “all” time-steps hidden state estimation problem. 
Both of these problems lead to reconstructing the entire sequence of hidden states 
strictly from the observations and HMM. Think of receiving a signal and being asked 
to retrieve or recognize the individual symbols from a coded sequence. Here the 
hidden states are the embedded sequence and the observations are the noisy digitized 
measurements. 

9.5.1 Individual Hidden State Estimation 

State estimation in HMM provides a methodology by which the hidden variables or 
states embedded within the hidden Markov model can be extracted from knowledge 
of the model parameters £ and the noisy observations as illustrated in Fig. 9.3b. The 
basic problem to be solved is: 

GIVEN the observation sequence, Yj and HMM parameters £, FIND the “best” 
(MAP) estimate x k (t) of the hidden state at time t based on the posterior distribution 
Yr(x k (t)\ Y t ), that is, 

hit) = argmax Vr(x k (t)\Y T ) for x k (t) x(t) = X k 

The solution to this estimation problem is analogous to the forward recursion algo¬ 
rithm and incorporates the so-called backward algorithm, since it proceeds sequen¬ 
tially backwards in time (smoothing). The solution which follows is accomplished by 
decomposing or partitioning the total observation sequence into two sub-sequences: 
Y t = (y(0),... ,y(t)\ and Y t+ \-j := \y(t + 1),... ,y(T)}. To see this let us investigate 
the solution to the state estimation problem assuming uninformative priors 

Pr(x k (t)\Y T ) = Pr Z T ;* kW) « P r(J T ,x k (t)) (9.15) 

m (Yt) 

Partitioning Yj as above we have, Yj = { Y t , Y t+ \ j) which can be used to 
decompose the joint distribution as 

Pr (Y T ,x k (t)) = Pr(Tf, Y t+UT ,x k (t)) = Pr (Y t ,x k (t)) x Px(Y t+l:T \Y,,x k (t)) 

Applying the Markov property along with the conditional independence properties of 
the HMM , we have 


Pr (Y T ,x k (t)) = Pr (Y t ,x k (t)) x Pi(Y t+1:T \x k (t )} 


(9.16) 
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Defining the backward operator as B k (t) := Yx{Y t+ \-j\x k (t)), then we can write the 
marginalization 

IMK,:i :7 |jc*(0)= Yj Mxe(t+l),y(t+l),Y t+2 :T\x k (t)) (9.17) 

X t (t+ 1) 

and applying the Bayes’ rule 

Pi(Y, +l:T \x k (t)) = Y V'(Y t+ 2:T\xdt + \),y(t+\),x k (t)) 

xe(t+l) 

xPr(y(t+l),^(f+1)^(0) (9.18) 

Now using the Markovian independence properties of the HMM and expanding the 
last term using the Bayes’ rule, we obtain 

Pi(Y t+1:T \x k (t)) = Y V< Y t+2:T\xdt + 1)) X Pr(y(t + l)|jt*(f + 1)) 

x t (t+ 1) 

xPr(x k (t + Y)\x k (t)) (9.19) 

Using the definition of the backward operator, we obtain the final backward 
recursion as: 

B k (t) = ^a w Pr(y(t+l)|x*(t+l))SKf+l) for t = T-l, T-2 ..., 1,0 (9.20) 

with B k ( 0) = 1 V k. This relation, when coupled with the forward operator can also 
be used to calculate the desired posterior probability for state estimation, since 

Pr (Y T ,x k (t)) = Pr(F f , x k (tj) x B k (t) = T k (t) x B k (t) (9.21) 

Thus, we have by marginalization that the total observation probability can be 
estimated by 

Pr(Tr) = Y Tk(T) x Bkit) (9 ' 22) 

and therefore the posterior distribution is given by: 

Mx k (t)\ Yt ) = E ^ (r) x Bk(t) (9-23) 

leading to the desired state estimate, xmapO)- 

With this information available, we now have the solution to the individual hidden 
state estimation problem using the backward recursion algorithm as: 

. Initialize: B k ( 0) = 1 V k\ 

• Recursion: B k (t) = «« Pr (v0 + 1)1*0 + Y )B f (t + 1); 

• Termination: Yx(Yt) = J2 k F k (T) x B k {t)\ and 

• Posterior: Pr (x k (t) | Y T ) = ; and 

• Estimation: x k (t) = argmax Pr(w^(T) Yj). 
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Together the forward-backward recursions are the key ingredient to estimating the 
HMM parameters from noisy observation data as well which will be discussed sub¬ 
sequently, but first we consider extending the MAP state estimation for the individual 
state (xyfc(f)) to the problem of estimating the entire sequence of hidden states (Xj) 
from the observation data, (Yj). 

An example follows to demonstrate the forward-backward recursion in estimating 
the posterior distribution. 

Example 9.3 

Suppose we have the discrete binary signal ( N x = 2) specified by: {x\ (t) = 1, X2(t) = 2} 
and the discrete observation specified by {yi(f)= 1, y 2 (f) = 2, 373(r) = 3}. Here we 
attempt to “decode” the message from the observations, that is, consider the binary 
state sequence generated by the HMM along with the corresponding the observations 
and we wish to extract the coded state sequence at each time-step using the forward- 
backward approach discussed above. 

We apply MATLAB (hmmdecode) to perform the estimation using the forward- 
backward approach for the 100 observation samples. Using the synthesized output 
data and corresponding transition and observation probability matrices, we can esti¬ 
mate the corresponding posterior distribution for each individual state, Pr(x^(f)| Y t ) 
with the results shown in Fig. 9.5. The combined state transitions are shown in a and 
the posterior state probabilities in b and c, respectively. We can see that the estimated 
probabilities “match” the states reasonably well with the state transitioning accord¬ 
ing to the estimated posterior at each time-step, that is, when the HMM is in state 1, 
the posterior probability is high relative to that of state 2 and visa-versa. This com¬ 
pletes the decoding example. Next we consider estimating the entire state sequence. 

AAA 


9.5.2 Entire Hidden State Sequence Estimation 

The maximum a posteriori estimation of the state at time t using the forward- 
backward recursion algorithm above can be extended to reconstruct the entire hidden 
state sequence which provides a more meaningful solution when attempting to extract 
a critical coded message from a hostile environment or accurately extracting a DNA 
sequence for forensic analysis. Unfortunately, estimating the individually “most 
likely” states as in the previous subsection does not imply that the entire sequence 
is estimated with minimal probability of error. Therefore, the state estimation prob¬ 
lem must be based on jointly estimating “all” states in the sequence to obtain the 
optimal solution. The individual state estimates at each time-step of the forward- 
backward algorithm minimize the error probabilities of individual states maximizing 
the expected number of correctly estimated states [16]. However, we are interested 
in estimating the entire state sequence, that is, we would like to solve the following 
problem: 

GIVEN the observation sequence, Yj and HMM parameters £, FIND the “best” 
(MAP) estimate the sequence Xj where Xj = {xfc(0),... ,Xk(T)} of the entire hidden 
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FIGURE 9.5 HMM realization of a discrete two-state Markov chain and observation: 
(a) Hidden state sequence realization, (b) State 1 posterior probability estimate, 
Pr(X! (f) | Y t ). (c) State 2 posterior probability estimate, Pr(x 2 (f) | Y/). 


state sequence from time-step 0 to time-step T based on the posterior distribution 
Pr(A> | Yt), that is, 

X T = argmax Pr(X T \Y T ) 

x T 

Since we are seeking a sequential solution to this problem, we must track the 
estimate at time-step t. Following the development in [16], for each x(t) a partial 
sequence of length t + 1 is defined for each possible state; therefore, there are N x - 
partial sequences for each t. An alternative is to use the joint distribution, since the 
observation sequence is fixed in length Yt, that is, we require 

X t -\ = argmax Yv(X t -\,x(t), Y t ) for x(t) endpoint (9.24) 

At each time-step, the maximal path ( X t _ i) problem terminating in x(t) given Y, is 
transformed into the maximization problem of finding the best path ending in x(t+ 1) 
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given K, This follows directly by applying the chain rule to the joint probability 
distribution 


Pr(X f+ i, Y t+ 1) = Pr(x(f + 1 ),X„y(t + 1), Y t ) 

= Pr(x(f + 1 ),y(t + X)\X„ Y t ) x Pv(X„ Y t ) (9.25) 

applying Bayes’ rule along with the conditional independence properties of the chain 
gives 

Pr(X ?+1 , Y m ) = Pr(y(t + l)|x(t + 1)) x Pr(x(f + l)|x(f)) x Pr(X h Y t ) (9.26) 
Recursively maximizing the probability gives 
max Pr(X/+\, Y t +\ ) = max{Pr (y(t + \)\x(t + 1)) x Pr(x(t + \)\x(t)) x Pv(X t , Y,)) 
= Pr(y(t + 1)| x(t + 1)) x rrutx{Pr(x(? + l)|x(/)) x Pv(X t , Y,)\ 

= Pr (y(t + 1)|x(f + 1)) x max jpr(x(t + \)\x(t)) 

x max {Pr(X r , F f )}| 

Now this gives us a recursion with V(x(t)) := max{Pr(X r , Y t )} 

V(x(t + 1)) = Pr (y(t + 1)| x(t + 1)) x arg max{Pr(x(t + l)|x(t)) x V(x(t))} (9.27) 

x(t) 

Defining the smoothing variable as 

U(x{t)) := max{Pr(x(t + \)\x(t)) x V(x(t))\ (9.28) 

enables us to construct the entire state (sequence) estimation or equivalently the Viterbi 
algorithm as [16]: 

• Initialize: 


V(x(0)) = Pr(x(0)),Pr(y(0)|x(0)) 

U(x( 0)) = 0 for x(0) = 1 ,...,N X 

• Recursion: 

V(x(t)) = Pr(y(t)|x(0) x max {Pr(x(f)|x(f - 1)) x V(x(t - 1))} 

W(x(t)) = max{Pr(x(t)|x(t - 1)) x V(x(t - 1))} forx(t) = 1,.. .,N X ; 
t = 2,..., T 
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• Termination: 


P = max{V(x(r))} 
x(T\T) = argmax{VWT))} 


• Smoothing: x(t\T) = U(x(t+ 1)|T)) for t = T - 1, T - 2,..., 0 

The Viterbi algorithm uses these recursions and smoothing relations to estimate 
the “optimal path” and has on the same order of operations as the forward algorithm 
discussed in the previous section. It has proved to be an extremely popular and robust 
algorithm to perform decoding. Consider the following example of a path estimate. 

Example 9.4 

Using the discrete binary signal and observation of the previous example we would 
like to “decode” the entire message from the observations, that is, we wish to extract 
the “entire” coded sequence at each time-step using the Viterbi approach discussed 
above. 

We apply MATLAB (hmmviterbi) to perform the optimal entire state sequence 
(path) estimation using the Viterbi algorithm for the 100 observation samples. Using 
the synthesized output data and corresponding transition and observation probability 
matrices, as before we obtain the estimation results shown in Fig. 9.6. Here we can 
observe the path (dark solid line) which corresponds to the sequence estimation. 
Note that it is a simple path but contains most of the states and leads directly to the 
desired result. The matching capability of this approach is captured by estimating the 
percentage of the time that the actual sequence agrees with the estimated. For this 
simulation, the estimated matches the actual sequence 52% of the time. AAA 

This completes the state sequence estimation example, next we consider estimating 
the model parameters. 


9.6 PARAMETER ESTIMATION IN HMM: 

THE EM/BAUM-WELCH TECHNIQUE 

The most challenging problem in HMM is the development of the model in the first 
place. Just as in dynamic (physical) systems theory [17, 18], the system identifi¬ 
cation/parameter estimation problem is still a highly researched problem especially 
for nonlinear systems. The basic estimation problem consists of two major issues: 
(1) estimation of the underlying internal structure (interconnections, state assign¬ 
ments, etc.); and (2) HMM parameter estimation consisting of the transition and 
observation probabilities and initial conditions assuming that the internal structure 
of (1) is known a priori. 
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0 10 20 30 40 50 60 70 80 90 100 


Sample no. 

FIGURE 9.6 HMM realization of a discrete two-state Markov chain sequence and the 
results of the Viterbi path estimation with an 52% match (estimated-to-actual states). 


For dynamic physical systems, the identification of the internal structural model 
usually evolves from first principles where relations governing the phenomenology 
are assembled. For non-physical systems, parametric models (with interconnections) 
are assumed (e.g., ARMA) and then the solution to the parameter estimation problem 
follows [19]. HMM are very similar from this perspective. Their structure is developed 
from an internal probabilistic representation that is application driven. For instance, 
the well-known problem in signal processing of recovering a transmitted random 
telegraph signal from noisy observations is representative. Here the problem is to 
“decode” the signal into a sequence of zeros or ones with the probability distribution 
of the transitions (zero-to-one) assumed known (Poisson). A HMM can be structurally 
developed to model this problem quite easily [2], Another example is the decoding 
of DNA strings for forensic analysis. Here the same modeling principle applies to 
develop the internal structural model [5], In any case developing the internal model is 
usually the task of the phenomenologist, whether a physical or non-physical system, 
and the next step is to “fit” the parameters of this structure to the model—the primary 
focus of this section. Thus, we discuss the development of the parameter estima¬ 
tion techniques applied to estimate the parameters of a HMM. Here we assume the 
number of states and measurements are known along with the internal structure and 
the problem becomes a matter of “fitting” these well-defined parameters (transition 
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probabilities, observation probabilities and initial conditions) to the known model 
internal structure. 

Parameter estimation for HMM has been a difficult and challenging problem, espe¬ 
cially in an on-line environment [1], The original efforts of Baum-Welch [20] have 
led to the general expectation-maximization ( EM) algorithm (see Chapter 2) which 
is a very powerful iterative approach using likelihood estimation techniques to solve 
this problem [21, 22], Here we briefly outline the iterative approach and then show 
how the algorithm is a special case of EM. 

The basic HMM parameter estimation problem is (see Fig. 9.3c): 

GIVEN a set or ./-sets 4 of observation sequences, { Y t (j)}; j = \,... ,J along with the 
underlying HMM internal structure E, FIND the “best” (MAP) estimate of the under¬ 
lying parameters, ©map '■= {amn,cu,P(0)} maximizing the posterior distribution, 
Pr(©|T r ). 

We will discuss this problem in two parts: (1) state sequence is known a priori-, 
and (2) state sequence is unknown [5]. 

9.6.1 Parameter Estimation with State Sequence Known 

When the state sequence is “known” a priori, then the parameter estimation problem 
is much simpler as in the case in physical systems for the design of optimal inputs for 
system identification [18], 

We define the following posterior probability 

Vmnit, t + 1) := Pr (x m (t),x n (t + 1)| Yt, 0) 

This posterior is the probability of the joint event that a path (state sequence) passes 
through state m at time-step t and through state n at / + 1 and the HMM generates a 
sequence of observations Yt given the model parameters, ©. 

To analyze this probability further, we apply Bayes’ rule and then partition the data 
as before 


v Pr(x m (t),x n (t+l),Y T \©) 

mn( ’ Pr(T r |©) 

= Pr(x m (t),x n (t + 1), Y t , F f+ i;7'|0) 

Pr(T r |©) 

now applying Bayes’ rule to the numerator results in 

Pr(x m (t),x n (t+\),Y t ,Y t+l:T \©) = Pr(x„(f + 1), Y t+ \ :T \Y t ,x m (i), ©) 

x Pr(F r ,x m (f)|©) (9.30) 


4 These sets are called training sets a term that evolves from the classification/neural-net technical area. 
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The last term is just T m (f), the forward operator (with the parameter set © given). 
Now concentrating on the remaining term of this expression, we extract the y{t + 1) 
from the data term and apply Bayes’ rule again to obtain 

Pr (x n (t + 1 ),y(t + 1), Y t+2 j\Y t ,x m {t), 0) 

= Pr(F ?+2 :rMf + l),yO + 1), Y t ,x m (t), 0) 

x Pr(x„(t + l),y(f + 1)| Y„x m (t), 0) (9.31) 

where the last term above decomposes further to 

Pr (x n (t + l),y(t + l)\Y t ,x m (t), 0) = Pr(y(f + l)\x n (t + 1), Y„x m (t), 0) 

x Pr (x n {t + 1)| x m {t), Y t , 0) (9.32) 

enabling us to simplify each of these terms individually to yield: 


\>r(Y t+2:T \x n (t + l),y(f + 1), Y t ,x m (t), 0) Pr(Y t+2:T \x n (t + 1), 0) 

Pr (x n (t + 1 ),y(t + 1)| Y t ,x m (t), 0) -* Pr (y(t + 1)| x n (t + 1), 0) 

x Pr(x„(t + l)\x m (t), 0) (9.33) 


and therefore we obtain the expression: 


Pmn(t,t+ 1) = Yy(Y,, x m (tj) x Pr(x„(t + l)\x m (t), 0) 

x Pr(y(f + 1)| x n (t + 1), 0) x WY t+2:T \x n {t + 1), ©)/Pr(F r |0) 


Finally, substituting for the known parameters, we have the desired result 


V mn (t,t+ 1) = 


Pn(t) 


a mn X Ckn X Bn(t + 1) 

Pr(F r |0) 


(9.34) 


where Tk(t) encompasses the past history ending at time t and state m while B„(t + 1) 
accounts for the path’s future which at time t + 1 is at state n evolving until the end. 
The product term (a mn x ) takes into account the current activity at t with discrete 
observation yk(t + 1) —> y(t +1). 

With this term determined, we then define the posterior distribution from 
Eq. 9.23 as: 

O m (t):=Pr(x m (t)\Y T ) (9.35) 

which is a probability of the joint event that a path passes through state m at time-step t 
and the HMM generates a sequence of observations Yt given the model parameters, 0. 

Note that both probabilities are related, since one can be obtained through 
marginalization of the other 


N x 


O m (t) = Y j V m „(t,t+\) 


(9.36) 
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Summing both these quantities “across time” enables us to obtain the expected number 
of times in state m and the expected number of transitions away from state m for Y 
(see [15] for more details) 

T -1 

Y O«(0 (9.37) 

Similarly, the expected number of transitions from state m to state n for Y is given by 
r-i 

Y V ^(t) (9.38) 


Thus, using these expectations and counting estimation of probabilities [5], we are 
able to obtain the Baum-Welch estimates: 


where 


Pm( 0 ) 


Ckn 


0,(1) 

JjY run 

Tl=i o m (t) 

Vf 1 Vmn(t) 

^§ § 1 suchthat ^ )= ^ 


(9.39) 


P m { 0) is the expected number of times in state, x n (t) at t = 1; 
a mn is the expected number of transitions from state, x m (t ) to state x„(t) over 
the expected transitions in state and 
c/a, is the expected number of times in state, x n (t) and observing y k (t) over 
the expected number of times in state x n {t). 


Thus, when all of the paths are known (this case), then it is possible to count the 
number of times each particular transition or output observation is applied in a set of 
training data. It has been shown that counting functions, say N mn (x(t)) for the state 
transitions and Nk n (y(t )) for the output observations provide maximum likelihood 
estimates for the desired model parameters [5], such that 


Nmnixit)) „ _ N kn (y(t)) 

TnNmnixit)) “ ^ £„ N kn (y(t)) 


(9.40) 


Next we consider the unknown path case and combine the above results to establish 
the algorithm. 


9.6.2 Parameter Estimation with State Sequence Unknown 

In this section we consider the case where the state sequence is “not known” [5] 
and must be determined using the current parameter estimates available. When the 
paths are unknown for training sequences, a closed-form equation is nonexistent 
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and therefore some type of iterative approach must be applied (e.g., EM [22-30]). 
The EM/Baum-Welch approach precisely solves the HMM parameter estimation prob¬ 
lem in an iterative manner. It first estimates the counting functions, N mn (x(t)) for 
states and A)t n (y(t)) for observations by considering possible paths for the training 
sequences using current model parameters © and then calculates the new estimates 
using Eq. 9.40. The algorithm continues to iterate until the log-likelihood function, 
lnPr(T7-|©) no longer increases with each iteration. Baum [20] has shown that the 
overall log-likelihood increases with each iteration indicating convergence to a local 
maximum. 

More precisely, this technique estimates the counting functions, N mn (x(t)) and 
Nkn(y(t )) as the expected number of times each transition or output is utilized from 
the given training sequences. It also uses the identical forward/backward operators as 
before (see Section 9.5) using the posterior probability V mn (t, t+ 1) of Eq. 9.34. From 
this relation, we can derive the expected number of times that a mn is used by summing 

over all possible positions and over all training sequences /= 1,_ J. We can 

also use the training sequences to derive the expected number of times the observation 
occurs to obtain: 

*«<*») = Eso^je5 

Once these expectations are estimated, the model parameters are updated as above 
and these new estimates are used in the counting functions. We summarize the 
EM/Baum-Welch algorithm as: 

• Initialization: © 0 , N nm (x(t)) and Nu(y(t )); 

• Forward/Backward Recursions: /Fk(t,j ) and Bk(t,j) of Eqs. 9.12 and 9.20; 

• Counting functions: N mn (x(t )) and Nu(y(t)) of Eq. 9.41; 

• Parameter Estimation: 0 = {a mn , Ckn, P(x(0))\ of Eq. 9.40; 

. Likelihood: Pr(T r |0); and 

• Termination: PrOVI©) < r for x a threshold. 

This completes the algorithm. It should be noted that an alternative approach 
to searching over all paths is to use the Viterbi paths providing the most probable 
paths for all of the training sequences. However, this approach does not maximize 
the true likelihood. It is known that Viterbi training does not perform as well as the 
Baum-Welch, but it is still popular when applied to decoding problems. 

There are also a number of techniques that practitioners use to enhance numerical 
performance and convergence of this technique. These include: (1) logarithmic trans¬ 
formation of the product probabilities to create sums [5]; (2) scaling both forward 
and backward operators [10]; (3) initial conditions; and (4) training data issues [23]. 
In closing, we note that the Baum-Welch algorithm is just a special case of the EM 


F’nitj) X amn X C kn X B n (t + 1 J) 

X B k (tJ) (9.41) 
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algorithm of Sec. 2.3. That is, the E-step of the EM algorithm is given by [15]: 
E-step: 2(©,©i-i) = J]lnPr(x|F r ,©) x Pr(*| K r , ©*_,) 

= ^ In P(x(Q))Vr(Y T ,x(t)\&) + £ fcln a mn (t -1,0^ 

x Pr(F r ,x(t)|0) + £ ^ In c: k n(t, f) j x Pr(F r ,x(t)|0) 
(9.42) 

where Pr(>V,x(f)|©) = P(x(Q)) £*»(*> 0 x flLi « mnif , t + 1). Optimizing these 

terms leads precisely to the expressions in Eq. 9.39 (see [10] or [15] for details). 
Thus, the E-step of the EM consists of estimating the required expectations using the 
forward/backward recursions which completely determines Q(@, ©) and the maxi¬ 
mum (M-step) consists of substituting these terms into the corresponding likelihood. 


Example 9.5 

Again using discrete binary signal and the discrete observation of the previous 
examples, we perform the parameter estimation of the transition and observation 
probabilities, first using, the “known” (actual) state sequence and then generating an 
ensemble of training sequences (N = 25) to perform the EM/Baum-Welch algorithm 
(hmmtrain) with a maximum of 500 iterations and a error tolerance of 1 x 10 -4 . 
Here we also use the MATLAB maximum likelihood estimation with the “known” state 
sequence (hmmestimate) as well to compare performance. The resulting parameter 
estimates and percentage errors are: 


EM/BAUM-WELCH PARAMETER ESTIMATES 


Atru 


"0.6 0.4" 
0.3 0.7J ’ 


Abw 


"0.62 0.38" . 
0.30 0.7 J ’ 


Ctru 

MoERR 


"0.50 0.25 
0.35 0.25 


"3 4" 

0 0 _ 


0.25" 

0.40_ 


"0.51 0.29 0.19" . 
0.34 0.22 0.44J ’ 


C%ERR 


"3 18 23" 

3 13 11 


Thus, the parameter estimates are quite reasonable under these conditions. Note 
the initial probability matrices are automatically established by this implementation 
in MATLAB ; however, it is possible to alter them if desired. 


MAXIMUM LIKELIHOOD PARAMETER ESTIMATES 


Atru = 


"0.6 0.4" 
0.3 0.7J ; 


_ [0.50 0.25 0.25" 
Ctru ~ [0.35 0.25 0.40_ 
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0.67 0.331 , f 12 18l 

0.36 0.64J ’ A%err ~ [21 9 J 

0.45 0.23 0.32] [ 9 9 281 

0.47 0.19 0.34J’ c %err- 34 23 15 J 

Again the estimates appear quite reasonable. Longer sequences can be employed 
to improve the estimates even further. We see that there is a distinct advantage when 
the state sequence is known a priori, because the training sequences are not required. 
The Viterbi initialization was also executed on this data, but it did not perform near 
as well as the Baum-Welch technique. AAA 



This completes the section, next we consider a case study. 


9.7 CASE STUDY: TIME-REVERSAL DECODING 

In this section, we consider applying the Viterbi algorithm to decode a message trans¬ 
mitted through a hostile environment with reverberations along with the processor 
and decoding algorithm. Acoustic time-reversal (77/?) communications is an applica¬ 
tion area motivated by the recent theoretical advances in T/R theory [30], Although 
perceived by many in signal processing as simply an application of matched-filter 
theory, a T/R receiver offers an interesting solution to the communications problem 
for a highly reverberant channel. This case study briefly describes an acoustic commu¬ 
nications experiment of data gathered in air and its associated signal processing. The 
experiment is developed to evaluate the performance of a point-to-point T/R receiver 
designed to extract a transmitted code information sequence propagating in a hos¬ 
tile, highly reverberant environment. These results are merely used to “synthesize” a 
HMM based on the raw/quantized acoustic measurements and then used to extract the 
transition and observation probabilities for simulation and evaluation. Even though 
this case study is based on real data, it is only chosen to illustrate the application of 
HMM techniques after data is simulated through the HMM process (evaluation). 

From a signal processing perspective, T/R processing appears to be an application 
of matched-filtering in which the output signal-to-noise ratio (SNR) is maximized. 
This T/R replicant is then cross-correlated with the noisy received signal to pro¬ 
duce the optimal filtered output [31], However, it becomes more complicated in 
the spatio-temporal case in which the optimal matched-filter must not only match 
the transmitted temporal function, but also the corresponding spatio-temporal chan¬ 
nel medium impulse response or so-called Green’s function. It has been shown that 
time-reversal techniques are applicable to spatio-temporal phenomena that satisfy 
a wave-type equation possessing the time reversal invariance property [30], Thus, 
time-reversal is the dynamic broadband analog of the well-known phase conjugate 
mirror used to focus narrowband monochromatic waves. It represents the “optimal” 
spatio-temporal matched filter in the sense of maximizing the output SNR. It is essen¬ 
tially a technique, which can be used to “remove” the aberrations created by an 
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inhomogeneous or random channel. In communications, the T/R receiver can over¬ 
come the inherent noise created by the medium providing the enhancement required to 
extract the transmitted information sequence. Here we ignore the array aspects of T/R 
by considering only point-to-point communications. In this case study the realization 
of a T/R receiver is briefly discussed and applied to a noisy microphone measure¬ 
ment in a hostile environment. It is then used to estimate the required transition and 
observation matrices for eventual synthesis/analysis. 

For time-reversal, the matched-filter in additive white noise is identical to that 
posed above with a “known” Green’s function of the medium replacing the known 
signal replicant [31]. The Green’s function, g(r, r Q \ t ), is the result of a point-to-point 
communication link between a station (source) at r 0 to a master station (receiver) 
at r. In this case, the matched-filter solution is again found by maximizing, SNR 0U t, 
leading to the solution that is satisfied with equality at some time T. If the resulting 
filter response is,/(t), then the solution is given by 

f(t) = g(r,r 0 -T-t) (9.43) 

Thus, for T/R, the optimal matched-filter solution is the time-reversed Green’s 
function from the link station-to-master station (source-to-receiver) or visa versa. 
Comparing these results with the standard matched-filter solution found in the litera¬ 
ture, the Green’s function of the channel is time-reversed rather than the transmitted 
replicant signal as in radar or sonar. Note that since T/R theory requires reci¬ 
procity [30], the result of Eq. 9.43 is valid for both transmission and reception, 
that is, g(r, r a ;T — t) g{r Q , r;T — t ). Note also that if an array is included to sample 
the spatial field or transmit a wave, then these results include the focus at link station 
(source) position, r 0 , yielding the optimal, spatio-temporal matched-filter solution, 
g(r(, r 0 ;T — t) at sensor position, r(. 

So we see that in transmitting a coded signal (state sequence) through a disruptive 
medium the distorting effects can be mitigated by time-reversing the estimated media 
Green’s function and creating an effective receiver. The details of this mechanism 
are discussed in [31-33] and is beyond the scope of this case study. Here we just 
describe one of the variety of receiver types that can be used, once the Green’s 
function is estimated from pilot signals transmitted from transmitter (speaker) to 
receiver (microphone) producing g(r, r a ;T - t). 

With the estimated Green’s function or impulse response available, we choose 
to apply time-reversal processing on reception [31] to the noisy received data, 
y(t) = g(r; t) * i(t) with i(t) the coded information (state) sequence. On reception, the 
estimated Green’s function is reversed and convolved with the receiver input to give 

R(t) = z(t) * g(r;-t) = g(r; t ) * i(t ) * g(r; -t) = C g ~ g (t) * i(t) (9.44) 

where C g ~ g is the estimated autocorrelation function of the medium possessing all of 
the reflection and scattering information—but modified for code signal enhancement. 
We show a typical T/R receiver output that was used to “synthesize” a discrete state 
and output sequence for transition and observation probability estimates. We show the 
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FIGURE 9.7 T/R processor acoustic microphone data: (a) T/R receiver structure, (b) Raw 
measurement data, synthesized observation and state data input for HMM parameter 
estimation. 


receiver structure in Fig. 9.7a where the time-reversed Green’s function is convolved 
with the received data and then quantized to recover the code. Actual T/R processed 
data is shown in b along with quantized state and observation sequence extracted for 
illustrative purposes and eventual application of the HMM techniques. 

The results of processing these quantized sequences using the EM/Baum-Welch 
algorithm are: 

EM/BAUM-WELCH PARAMETER ESTIMATES 


[0.42 

0.581 

- [i o 1 

|_0.28 

0.72J ; 

Cbw - y Q 22 Q ?8 J 


Next we used these extracted probability matrices to synthesize “realistic” T/R data 
for HMM processing, the results are shown in Fig. 9.8. With this available we proceed 
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State sequence 



Sample no. 



(a) Hidden state sequence realization, (b) Observation realization. 


as before and estimate the individual states as depicted by the posterior probabilities in 
Fig. 9.9. These results are quite reasonable as can be observed by comparing samples 
with the aligned probability functions at each time-step. Next the entire state sequence 
was estimated using the “most likely” Viterbi approach and the results are illustrated 
in Fig. 9.10, where both the actual (synthesized time-reversed) state sequences are 
shown along with the Viterbi result superimposed. The agreement is quite good with 
an 85% matching (in-time) of the actual with the estimated states. 

With these synthesized probability matrices, we performed the parameter estima¬ 
tion approach as before to give 


EM/BAUM-WELCH PARAMETER ESTIMATES 
Ctru = 


Atru 
Abw = 
Cbw 


[0.42 0.58], 
~~ |_0.28 0.72J ’ 

_ [0.45 0.55] 
~~ [0.27 0.73J ; 

= [! 0 I- 

[0.24 0.76] ’ 


A%err = 
C%ERR 


= p 0 1 

[0.22 0.78] 

-P 3 ' 
-[! 5 ] 
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Hidden state simulation 



Sample no. 



Sample no. 



0 10 20 30 40 50 60 70 80 90 100 

Sample no. 


FIGURE 9.9 HMM estimation of the T/R discrete two-state Markov chain and observa¬ 
tion: (a) Hidden state sequence realization, (b) StateJ posterior probability estimate, 
Pr(x,(f) | Y t ). (c) State 2 posterior probability estimate, Pr(x 2 (f) | Y t ). 


Thus, the parameter estimates are quite reasonable under these conditions. Note 
the initial probability matrices are automatically established by this implementation 
in MATLAB\ however, it is possible to alter them if desired. 


MAXIMUM LIKELIHOOD PARAMETER ESTIMATES 


Atru — 

Aml = 

Cml — 


[0.42 0.581 
10.28 0.72] 


'0.42 0.581 
0.29 0.711 


J [o .22 0.78^ 
m = [ 1 !] 


The estimates are quite reasonable. We see that there is a distinct advantage when 
the state sequence is known a priori, then the training sequences are not required. 
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0 10 20 30 40 50 60 70 80 90 100 
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FIGURE 9.10 Entire state sequence estimation using the Viterbi algorithm illustrating 
the most likely path with an 82% match between the actual state sequence and the 
estimated. 


Again the Viterbi initialization was also executed on this data but it did not perform 
very well. This completes the case study. 


9.8 SUMMARY 

In this chapter we have introduced the concept of hidden Markov models and illus¬ 
trated their internal characteristics through a state-space representation. We first 
developed the concepts of Markov and hidden Markov chains and showed how 
they were related. Next, we investigated properties of the HMM illustrating how 
the Bayesian concepts easily transfer over to this discrete representation. We next 
investigated the three fundamental problems along with some variations: (1) the eval¬ 
uation (simulation) problem; (2) the state estimation problem; and (3) the parameter 
estimation problem. A careful analysis of each led us to the popular Viterbi decoding 
technique and the specialized EM algorithm popularly called the Baum-Welch tech¬ 
nique. We concluded with a case study to decode a transmitted coded sequence from 
data enhanced by a time-reversal processor. 
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MATLAB NOTES 

MATLAB has a Statistics Toolbox that incorporates the capability to develop and 
process hidden Markov models ( HMM) along with demonstrations and a tutorial, 
hmmgenerate synthesizes a sequence for a HMM, while the command hmmde- 
code calculates the posterior state probabilities of a given sequence (Sec. 9.3). 
The most likely (entire) state path can be estimated using the Viterbi algorithm 
of Sec. 9.5 (hmmviterbi). Training sequences can be used to solve the HMM 
parameter estimation problem (Sec. 9.6) while the EM/Baum-Welch technique 
is used to estimate the model parameters using the hmmestimate command. The 
PDF estimators include the usual histogram (hist) as well as the sophisticated ker¬ 
nel density estimator (ksdensity) offering a variety of kernel (window) functions 
(Gaussian, etc.). 

There also exists NETLAB which is a free MATLAB software package that 
includes HMM algorithms (see [24] for details and website) as well as the 
MATLAB- based software package (free) called the “HMM toolbox” by K. 
Murphy ( http:www.cs.ubc.ca/murphyk/Software/HMM/hmm_usage.html ) which 
can be downloaded for use. The EM algorithm is well documented [20-22,25-29] 
and an integral part of each of these packages. 
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PROBLEMS 

9.1 Suppose we have a noisy binary channel with input (symbol) x and output 
(symbol) y such that x e X= {0,1} andy e^ = {0,1}. The transition probability 
matrix is: 


rp r (y = 0|x = 0) Pr(y = 0|x = 1)1 [ 0.85 0.15] 

_ |_Pr(y = l|x = 0) Pr(y = 1 |x = 1)J |_ 0.15 0.85 J 

Assume we observe the symbol y — 1 and x ~ Pr(x(0)) = [0.9 0.1]', then 

(a) What is posterior probability of x? 

( b ) Let x = 1, what is the posterior if this is true? 

(c) Let x = 0, what is the posterior if this is true? 

( d) Suppose y = 0, what are the corresponding posteriors of x? 

9.2 A fly [4] moves along a straight line in unit increments. At each time period 
it moves one unit to the left with probability 0.3, one unit to the right with 
probability 0.3 and stays in place with probability 0.4. A spider is hiding at 
positions 1 and m: if the fly lands there, it is captured by the spider and the 
process ends. 

(a) Construct a Markov chain model, assuming the fly starts at positions 
2. m — 1. 

(. b ) Sketch the corresponding directed graph. 

(c) What is the probability of the following state evolution sequence: 
Pr(A| = 2, X 3 = 3, *4 = 4| *2)? 

9.3 Consider placing a ball in one of N compartments at each event. Each com¬ 
partment can hold multiple balls. Let x,-; i = 0,..., N be the state where k 
compartments are occupied. At the next event, the next ball can go into one of 
the occupied compartments with probability k/N or an empty compartment. 

(a) Create a Markov chain for this problem. 

(b) What is the state diagram? 

(c) What is the transition matrix, A. 

9.4 Suppose we have a discrete state-space and discrete-time measurement sys¬ 
tem (sampled-data). Defining the discrete (finite) states as x,(f) and the 
measurements as y(t), then: 

(a) What is the state prediction probability, Pr(x,(f)| K,_i)? 

(b) What is the state posterior distribution, Pr(x,ff)| D? 

(c) Sketch out the steps of the Bayesian filtering operation at times: t — 1 and t. 

9.5 Autoregressive ( AR(N a )) models (all-pole) occupy a large part of the signal 
processing literature especially in speech applications [10]. 




366 


DISCRETE HIDDEN MARKOV MODEL BAYESIAN PROCESSORS 


(a) An AR model is an example of a Markov chain on a continuous space, show 
that the AR( I) model forms a Markov chain, that is, 

y(t) = a,y(t- l) + e(t), e ~ AA(0, a 2 ) 

(b) Show an AR(N a ) model also forms a Markov chain. (Hint: Place the AR 
model in state-space form). 

(c) Does an ARMA(N a ,N a ) model form a Markov chain, as well? 

9.6 Autoregressive ( AR(N a )) switching models also occupy a significant part of the 
time series literature [14], It is a model where the mean can switch between 
two values, p o and p i. It enables the time series with several regimes or local 
nonstationarities to be represented as: 

Pr(y(OI;y(f - l ),x(t),x(t - l)) ~ U(p x p) + fl iC y(t - 1) - fiMt- d)),o- 2 ) 
Pr(x(t)\x(t - 1)) ~ p x(t \yL xU l)(x(t)) 

+ (i - p X (,-i))h- X (t-iMt)) 

where the hidden state, x(t) takes values in {0,1} with initial values, x(0) = po 
and y(0) = 0, and 1 is an indicator function. 

(a) What is the sampling distribution, P\'(x(t)\x(t — 1), x(t+ 1), y(t), y(t — 1), 

y(/+l»? 

(b) Using the priors, Pr(/i | — po) ~ AA(0, y 2 ), Pr (a, a 2 ) with po,pi ~W(0,1) 
simulate the AR switching model for N — 500 samples. What are the final 
parameter estimates for: [p, o, /xi, a,pi ,/?2)? 

9.7 Consider a 4-state Markov switching model [34] with the set of discrete states 
given by S = {1,2,3,4} representing a Markov chain with transition matrix: 

1 — 3a a a a 

P \-P~2y y y 

P Y 1 — P — 2y y 

P Y Y \-p~2_ 

The observation sequence is specified by a set of AR(2) models: 

y(t) = a u y(t - 1) + a 2l y(t - 2) + e,(f) for S,; i = 1,..., 4 

where e ( - ~A/ r (0, a 2 ) 

(a) What is the (state) prediction probability for this model, Pr(.sfr) | K,-i)? 

(b) What is the (state) filtering posterior probability for this model, Pr(.v(t)| Y t )l 

(c) What is the likelihood for the following set of parameters describing this 
model, © = {a, p, y, a 2 , an, a 2 i}; i = 1,..., 4? 
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( d) Simulate the system with the following parameters {a, P, y, a 2 } = 0.0033, 
0.016, 0.002, 0.1 and {au,a 2i }= {(1.785, -0.903), (1.344, -0.903), 
(1.386, -0.640), (0.800, -0.640)} for N= 1000 samples. 

9.8 Suppose we have a HMM model with discrete state-space defined by 
X = {Xj(t) = Xj } i = 1,... ,N X with state transition matrix A. We would like 
to establish a parameter estimation problem given the observation sequence, 
Yj. Assume that the complete likelihood is given by Pr(y,x|0) and the den¬ 
sity is given by: Pr(y|x,-, 0) =N(ji x U)^ E) with known diagonal covariance and 
unknown mean. Develop the EM algorithm to estimate the parameter /x, that is, 

(a) What is the Q-step? 

(b) What is the M-step? 

9.9 State-space models are perhaps the most important class of linear dynamic 
systems characterized by: 


x{t + 1) = Ax(t) + w(t) 
y(t) = Cx(t) + v(t) 

Develop the EM algorithm for this class of model based on the usual Gauss- 
Markov assumptions. 

(a) Suppose we are asked to estimate the A and C parameters. What is 
corresponding the E-step? ( Hint: Use gradient techniques.) 

(b) What is the associated M-step? 





10 

BAYESIAN PROCESSORS 
FOR PHYSICS-BASED 
APPLICATIONS 


In this chapter we develop a set of Bayesian signal processing applications based on the 
underlying physical phenomenology generating the measured process. The complexity 
of the process and the desire for enhanced signals drives the design primarily indicated 
by the process model. We motivate each application, succinctly develop the process and 
measurement models, develop the BSP and analyze its performance. More details on 
any of the designs can be obtained from the references for the interested reader. The 
main objective is to demonstrate the applicability of the Bayesian approach to a variety 
of interesting applications and analyze their performance. 


10.1 OPTIMAL POSITION ESTIMATION FOR 
THE AUTOMATIC ALIGNMENT 

The alignment of high energy laser beams for fusion experiments demand high preci¬ 
sion and accuracy of the underlying positioning algorithms whether it be for actuator 
control or monitoring the beam line for potential anomalies. This section discusses 
the development of on-line, optimal position estimators in the form of Bayesian pro¬ 
cessors to achieve the desired results. Here we discuss the modeling, development, 
implementation and processing of Bayesian processors applied to both simulated and 
measured beam line data. 

10.1.1 Background 

Alignment of a high power beam is a complex and critical process requiring precise 
and accurate measurements. Misalignment of such a beam could easily destroy costly 
optics causing a deleterious disruption of an entire experiment. The alignment of large 
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operative, short pulse, laser systems is a significant and costly endeavor dating back to 
the late sixties and early seventies [1], Contemporary imaging systems employ high- 
resolution video cameras to image and accurate position control systems to align the 
beam. This approach estimates the current beam position from the image, adjusts 
mirrors relative to an accurate reference measurement of physical beam center and 
minimizes their difference or deviation [2-9], However, even with these sophisticated 
measurements, the beam position estimate is still a function of the inherent beam noise 
caused by the internal beam line gas turbulence as well as instrumentation noise [10]. 
Here we introduce the idea of post-processing these uncertain measurements using 
advanced signal processing techniques to “enhance” the raw data and “detect” any 
beam line anomalies during laser system operations [11], 

High power, tightly focused laser beams are required to achieve successful igni¬ 
tion and therefore fusion at the Lawrence Livermore National Laboratory (LLNL) 
National Ignition Facility (NIF) [12], These beams must simultaneously focus pre¬ 
cisely on a nanoscale target capsule to succeed. Therefore, there are a large number 
of alignment measurements that must be performed along the NIF beam line to assure 
that the pointing and alignment control system centers the beam in order to provide the 
maximum energy on the fusion target located in the associated chamber [12-14], An 
automatic alignment (AA) system was designed and implemented to assure successful 
deployment of the high energy beam in each of the 192 beam lines. However, since 
a variety of techniques are provided to perform the alignment, there is a quantifiable 
uncertainty associated with each technique that may or may not meet the desired accu¬ 
racy and precision specifications associated at each control point. Therefore, there is 
a need for a post-processing technique, which accepts as input an uncertain position 
measurement and provides as output an improved position estimate (see Fig. 10.1). 
As illustrated in the figure, the measured image position estimate is compared to the 
reference image estimate providing an position error that is input to the mirror control 
system that moves the associated mirrors correcting the beam position. 

Perhaps the most challenging of all beam line measurements are those made on the 
KDP (potassium dihydrogen phosphate) crystals. These crystals are critical elements 

Optimal position estimator/anomaly detector 



FIGURE 10.1 Optimal position estimation with anomaly detector: Bayesian post¬ 
processing and innovations-based anomaly detection. 
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used to double or triple the frequency of the laser beam as it passes providing a shorter 
wavelength. The higher frequency determined from the target physics enables laser 
plasma interactions for fusion. The NIF final optics assembly matches lenses to this 
particular frequency producing the beam that is tightly focused on the target capsule 
required to achieve fusion ignition. In order for the KDP crystals to optimally double 
or triple the laser operating frequency, they must be precisely positioned at the appro¬ 
priate angle. This is one of the critical tasks of the AA system. The KDP measurement 
consists of using a charge coupled device (CCD) imaging camera, which produces a 
noisy back reflection image of a diagnostic alignment beam. The noise is caused by the 
camera itself as well as uncertainties that are due to small pointing errors made during 
the measurement. A sophisticated two-dimensional (2D) phase-only, matched-filter 
[15] algorithm was developed to provide the initial raw position estimates. 

The image acquired from the CCD imaging camera produces both noisy mea¬ 
surement and reference images. A precise reference image is used to provide the 
desired fiducial that is used by the alignment system. Corrections to align the mea¬ 
sured image with the reference is accomplished using the dedicated control loops that 
adjust pointing mirror stepping motors until the deviations between both reference 
and measured positions are within acceptable limits [14], Ultimately, the goal is to 
make this difference zero assuring proper beam alignment. Fig. 10.2 shows a sequence 
of measured KDP images with position estimates (O) along with the corresponding 
reference image measurement (+). Note how the control loop adjusts the KDP mea¬ 
surements (O) until the centroid position converges to the reference position (+). The 
objective, therefore, of the control loop is to adjust the beam mirrors such that both 
measured and reference positions completely overlap (zero deviation) as shown in the 
bottom of this figure. Thus, the smaller the XY-deviations, the closer the beam is to 



FIGURE 10.2 Raw KDP crystal back-reflection image and reference position estimates. 
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the centerline reference assuring a tightly focused, high energy beam on target—the 
goal of the alignment system. Should, for some reason, an anomaly develop in any 
beam line, it may not be possible to align for a particular shot. The timely detection 
of beam line anomalies are necessary to avoid future problems, which, if left unmit¬ 
igated could result in less than optimum performance of the laser. In this work, we 
show how a Bayesian processor (BP) can be used to detect anomalies, on-line, during 
the calibration phase of a laser shot. 

Thus, we discuss the feasibility of applying a Bayesian processor as an on-line post¬ 
processing technique to improve the final position estimates provided by a variety of 
estimators and detect anomalies in a high energy laser beam line [11]. These estimators 
are used to align high-powered laser beams for experiments at NIF. We first motivate 
a stochastic model of the overall process and measurement system. Next, we discuss 
the underlying theory for both position estimation and anomaly detection. With the 
theory in hand, we develop the processor for the KDP application through ensemble 
statistics, simulation and application to real measurement and reference data. 

10.1.2 Stochastic Modeling of Position Measurements 

The typical beam position estimator is accomplished by calculating the centroid of 
the measured images [18]. The point is that these measured positions are derived 
from the associated noisy CCD images. We define the true centroid positions by the 
position vector p(k) := [x(k) I y(k)]', where p e R n p x1 and k is the sample time. Since 
quite a number of images are acquired daily, a large data base consisting of position 
estimates are available, say, Pit) = [p(k)j; k = I..... AT and the position estimates are 
to be updated continuously. Since we know that the images are contaminated with 
noise and uncertainty, a more reasonable position measurement model is given by 

z (k) = Cp(k) + \(k) (10.1) 

where z, v e R^- 7 x 1 , C e R ,v - x,v /> and v ~ N( 0, R„„), that is, the measurement noise is 
assumed zero mean, multivariate Gaussian with covariance matrix, R,,,, e R N ? xN ? . 

We also note that since the CCD camera uses the identical beam line to measure 
both images, we model the position estimates as piecewise constants contaminated 
with beam line noise besides that noise contributed by the measurement systems. This 
process noise can be considered fluctuations caused by the inherent system optical 
transfer functions and turbulence caused by the argon gas filled housing during the 
laser beam propagation (boiling noise). Therefore, we assume that the contaminated 
position measurement is represented as 

p(f) = 0 + w(t) with p(0) = [jc(0) | y(0)]' (10.2) 

in continuous time, but if we discretize over the k-sample images, then using first 
differences we obtain 

P(0 « P(tk+l) —= w(4) for A t k := 4+i - t k 
A t k 


(10.3) 
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Substituting into Eq. 10.2, we obtain the following Gauss-Markov position model 


p(4+i) = p(4) + A ft w(f*) (10.4) 

where p,we R' v '' x 1 and w~ N(0, R w ) along with the accompanying measurement 
model of Eq. 10.1. This completes the basic description of the underlying Gauss- 
Markov model. We note in passing that for a complete description of the general 
model including means, m p (k),m,(k) with their associated covariances see [11] for 
more details. 

Since we have both final measured and reference position estimates, we re-define 
a more convenient position vector by grouping the positions as 


' x(k) ' 


' w x (k)' 

y(k) 


Wy{k) 

— 

and w (k) 

— 

Xr(k) 


w Xr (k) 

_ yAk) _ 


_Wy r {k)_ 


where p(0) ~ N(p(0), R pp (0)) are the respective measurement and reference positions 
coordinates (pixels) with the corresponding process noise covariance matrix given by 

-R ww | 0 

R„,„, = - - 

0 | R WrWr 


since each of the measured and reference images are uncorrelated. 

This type of formulation enables us to estimate the process noise covariances 
independently for each image as well as characterize their uncertainties individually. 
Since the control loop jointly uses both the reference and measured images to produce 
its final centroid position estimates, we model the measurement matrix as the deviation 
(difference) between these data contaminated with independent measurement noise, 
that is, our measurement model of Eq. 10.1 becomes 


with measurement covariance matrix 


. Note that the deviations are defined by 


Ap(k) := Cp(fc) 


A x(k) 
A y(k) 


'x(k) - Xr(k) 
_y(k) - y r (k)_ 
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This completes the section on position (uncertainty) modeling, next we consider the 
development of the optimal position estimator. 

10.1.3 Bayesian Position Estimation and Detection 

It is well-known that the optimal solution to the position estimation problem under the 
Gauss-Markov model assumption is provided by the Bayesian (state-space) processor 
or Kalman filter of Chapter 5. This solution can be considered a predictor-update 
design in which the processor uses the model (random walk) to predict in absence 
of a measurement and then updates the estimate when the measurement becomes 
available. Under the assumption of a perfect model, that is, the model embedded in 
the process exactly matches the process, the BP is capable of achieving the optimal 
minimum (error) variance estimate under the Gaussian assumptions. 

For the position estimation problem, the BP takes on the following predictor-update 
form where 4 —> k: 

Prediction: p{k + 11 k) = p(k | k) 

Innovation: e(k + 1) = z(k + 1) — z(k + 1 \k) = z (k + 1) — Cp (k + l|k) 

Update: p(k + 1| k+ l) = p(k + l|k) + G(k+ l)e(k+ 1) 

Gain: G (k + 1) = P (k + l\k)C T R~\k + 1) 

for p(k+ l|k) := £;{p(^)|p(A: — 1),... ,p(l)} and z(k+\\k), the underlying condi¬ 
tional means and G (k+ 1 )gR n p xNz is the corresponding gain matrix calculated 
from the error covariance matrix, P(k+ l|k) = Cov(p(k + I \k)) with position error 
p(k+ I \k) := p(k + I) — p(k + I \k). The innovations covariance matrix is defined by 
Ree(k+ 1). 

Recall that for a Gauss-Markov representation a necessary and sufficient condi¬ 
tion for the BP to be optimal is that the innovations sequence be zero mean and 
white (uncorrelated)—conditions that we test during processor design. It is possible 
to exploit this property of the innovations to detect anomalies in the system reveal¬ 
ing potential problems in the beam line in pseudo real-time—a large asset when 
attempting to automate the alignment system. We apply two detection techniques to 
monitor the position estimates from the daily measurements: zero-mean/whiteness 
detector and the weighted-sum-squared-residual (WSSR) detector (see Sec. 5.7 for 
details). Both of these techniques rely on the assumption that the position deviations 
in the loop should change little during the measurement cycles. The underlying idea 
in anomaly detection is that the processors are “tuned” during calibration to operate 
in an optimal manner (innovations zero-mean/white). However, when an anomaly 
occurs, the innovations no longer maintain their optimal statistical properties leading 
to a “change from normal” and an anomaly is declared. With this detection, a decision 
must be made to either classify the anomaly or perform some other action. The main 
point to realize is that since we are using a recursive-in-time BP, the innovations 
represent “how well the position model and its underlying statistics represent the raw 
data.” If it is an optimal fit, then the innovations should be zero-mean and statistically 
white (uncorrelated) and the detectors evolve naturally. 
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This completes the description of the BP and the associated anomaly detectors, 
next we consider a simulation of the data and processor to predict the expected 
performance. 

10.1.4 Application: Beam Line Data 

In this section, we discuss the application of optimal position estimators based on 
daily historical position estimates (data base) provided by the KDP algorithm. The 
data are a record of 48 days of final positions output from both the accurate reference 
system (reference data) and the estimated final position output from the KDP back- 
reflection data coupled to a phase-only, matched filter [15] imaging algorithm. The 
motivation for applying the Bayesian approach to this data set is to obtain a more 
accurate and precise estimate of error deviations from the reference characterizing 
the overall control loop performance in that position of the beam line. Theoretically, 
the position deviations between the reference and the KDP estimates should ideally be 
zero, but because of the beam line noise and variations, CCD camera limitations and 
control system tolerances, this is not the case. Instead the overall error statistics are 
used to bound the performance and assure that they remain within design specification. 

10.1.4.1 BP Design In the design of the BP the usual procedure is to: (1) develop 
the required models; (2) simulate a set of position data characterized by any of the 
a priori information available; (3) develop the minimum error variance design; and (4) 
apply the processor to the available data set evaluating its performance. We developed 
the basic model set assuming a Gauss-Markov structure as in the previous subsection. 
These parameters of this model were estimated from the statistics of a large ensemble 
(>5000) of images and performance statistics of the KDP centroiding algorithm. We 
used this information to construct the initial simulation of the BP to “match” with the 
historical data available (48 days) and the corresponding ensemble statistics. 

Thus, our approach is to first perform a simulation of the measurement process 
using estimated statistics from the data and then apply it to the actual data. During 
the simulation phase, we are able to analyze the performance of the BP and assure 
ourselves that all of the models and statistics are correct. We expect to obtain the 
minimum variance estimates, if not achieved, then it is usually an implementation 
issue. Once the simulation and model adjustments have been made, we apply the 
processor to the measured data. 

10.1.4.2 Simulation Now that we have developed the models and have some 

estimates from the ensemble statistics of the database, we are now able to perform 
a Gauss-Markov simulation to assess the feasibility of the BP. We simulated a set 
of data based on the following parameters using the mean XY-position estimates, 
that is, Xi = p, Xj ± 1.96er, for both the measured and reference data: x(0) = [344 ± 8, 
270 ± 13.8, 344 ± 7.8, 270 ± 14.4]'. We used the mean values as the initial position 
estimates with low error variances (lx 10 -6 ). We chose an uncorrelated measurement 
(deviation) noise covariance matrix as: = diag[0.0076 0.0078]. The process 
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noise covariance selected is 


R ww (2) 
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The synthesized positions have small variations due to the process noise, but are 
essentially constants (mean values). The simulated noisy DX, DY-deviation mea¬ 
surements (dotted line) including the process and measurement noise are shown in 
Fig. 10.3. Next we executed the BP over this data set and the position deviations (solid 
line) are also shown in the figure. Here we see that much of the raw measurement and 
process noise have been removed by the processor and only the deviation errors are 




FIGURE 10.3 Simulated XY-position deviations: Raw (dotted) and estimated (solid), 
(a) X-deviation. (b) Y-deviation. 
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shown. Recall that they should be close to zero for the alignment control system to 
be operating efficiently. 

To validate the results, we investigate the minimum variance design procedure. We 
investigate the statistics of the innovations sequence and check that they are zero-mean 
and white for optimality. Both innovations sequences are zero-mean: (0.30 < 3.5; 
0.95 < 3.3) and white: (4% out; 0% out); with the WSSR decision function lying 
below the threshold for a window of size N: (Threshold = 69.6, N = 25). These 
statistics indicate that the minimum variance position deviation estimates have been 
achieved. This result is expected, since the models used in the BP are identical to 
those in the Gauss-Markov simulator. However, the results can be interpreted as the 
“best” (minimum variance) one could hope to do under these circumstances. Next we 
apply the BP to the actual deviation data. 

10.1.5 Results: Beam Line (KDP Deviation) Data 

In order to process the KDP deviation data, we must develop parameters for the under¬ 
lying model embedded in the processor. We start with guesses of the noise (process 
and measurement) statistics from the simulator and then adjust them accordingly. 
For instance, if the innovation sequence lies outside its predicted bounds implying 
that the measurement noise variance is too small, then it can be adjusted to satisfy 
this constraint and “match” the data. The process noise is actually quite difficult to 



FIGURE 10.4 Actual KDP data XY-deviations and estimated XY-positions: Raw measured 
data (dashed). Estimated data (solid): (a) AX-position. (b) AY-position. 
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select, since it is directly proportional to the BP gain. In essence, once all of the other 
model parameters are reasonably adjusted, they are then held fixed and the process 
noise covariance is varied to achieve the “best” possible (minimum error variance) 
innovations statistics (zero-mean/white, WSSR below threshold). 

The results of the BP design for the KDP deviation data as shown in Fig. 10.4 along 
with the corresponding error ellipsoids in Fig. 10.5. We see the raw KDP position 
deviation estimates along with the optimal processor results over the 48-day period in 
Fig. 10.4. It is clear that the processor is tracking the trends in the data while reducing 
the noise or equivalently enhancing the SNR. To confirm this we observe the three- 
sigma error ellipsoid plots of Fig. 10.5 where we observe that much of the uncertainty 
has been removed and the estimated deviations are clearly clustered (centered) around 
the (0,0) position and have a much smaller ellipsoid (better precision) than the raw 
data. Again to confirm the optimality of the processor we check the zero-mean- 
whiteness/WSST? statistics which give the following results: zero-mean: (0.08 < 0.80; 
0.05 < 0.71); approximately white: (4%out; 8%out); and WSSR decision function lies 
below the threshold for a window of size N: (Threshold = 69.6, N = 25). The results 



DX-deviation (pixels) 


FIGURE 10.5 Three-sigma error ellipsoid for KDP XY-position deviations from simulated 
data: Raw deviations (O) and estimated (+) with less outliers. 
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are shown in Fig. 10.6. These statistics again indicate that the minimum variance 
design has been achieved and the processor along with its associated statistics are 
valid and optimal. This completes the BP design for the position estimation problem. 
Next we investigate the feasibility of using this approach for anomaly detection. 

10.1.6 Results: Anomaly Detection 

We use the actual KDP measurement data as before to assess the feasibility of applying 
the BP as an effective means to detect anomalies (off-normal) that could occur in the 
beam line. Here we assume that the BP has been “tuned” to normal operations for 
the particular beam line and optics. Thus, the innovations are zero-mean, white and 
the WSSR lies below the threshold for optimal design. To “simulate” an off-normal 
condition (in terms of position estimates), we increase the amplitude of both X and 
Y deviations fivefold in the raw measurement data during the period of 29-35 days. 



FIGURE 10.6 BP design for KDP XY-position deviation data: (a) Innovations for X-deviation 
and Y-deviation. (b) Optimality tests: Zero-mean (0.08 < 0.80; 0.05 <0.71) and whiteness 
(4% out; 8% out) (c) WSSR Test (threshold = 69.6, N = 25). 
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Time (days) 


FIGURE 10.7 Raw KDP data XY-deviations with simulated anomaly: Raw (dotted) and 
estimated (solid). 


This can be thought of as corresponding to a pulse-like transient in either or both 
of the measurement and/or reference position data. We applied the optimal (for no 
anomaly) BP to this data to determine whether or not it could detect and track the 
unknown transient anomaly. 

The results of the effect on the deviation measurements are shown in Fig. 10.7. 
We see the normal measurement and then the transient “jump” within the prescribed 
time period. It is interesting to note that there is a corresponding disturbance in the 
estimated (filtered) measurement indicating that there is enough of a disruption at 
this SNR to cause the BP to track it. However, the BP is not able to track the tran¬ 
sient extremely well—the expected results. Next we observe the deviation innovation 
sequences output by the BP in Fig. 10.8. We again observe the transient anomaly in 
both sequences, since the position estimates do not track it well enough. The zero- 
mean/whiteness tests seem a bit too insensitive to the rapid change (6 samples) and 
do not dramatically detect it even the whiteness tests do not indicate a non-white 
sequence (0% out; 4% out). On the other hand, the WSSR test clearly detects “change 
from normal” caused by the simulated beam line anomalies almost instantaneously 
indicating a feasible solution. This is again expected since the WSSR statistic (p(l)) 
can be tuned to transient disruption by selecting the appropriate window length. For 
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Time (days) 

(c) 


FIGURE 10.8 BP detection with simulated anomaly: (a) DX and DY innovations with 
anomaly, (b) Optimality tests: Zero-mean (0.07 < 1.7; 0.05 < 2.3); whiteness (0% out; 4% 
out), (c) WSSR test (Threshold = 18; N = 5). 


this problem we chose: (At = 5; Threshold = 18). To verify the performance of the 
zero-mean/whiteness test we observe the estimated positions (states) in Fig. 10.9. 
Here we see that the BP actually responds to such a high amplitude level change in 
the anomaly transient. The processor could easily be “tuned” to ignore the change of 
the states by decreasing the process noise variance. Finally, the scatter plots for this 
case are shown in Fig. 10.10: one for the optimal solution (no anomaly) and one for 
the solution with the anomaly. The size difference of the error ellipsoids in both cases 
is obvious due to the added uncertainty of the modeled anomaly. The three-sigma 
error ellipsoid for the case with an anomaly suspiciously indicates that something is 
different, since the estimated deviation uncertainty (ellipsoid) is actually about the 
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FIGURE 10.9 Estimated XY-positions with anomaly: (a) Measured position estimates, 
(b) Reference position estimates. 


same size or larger than that of the raw data. This completes the study of applying 
Bayesian anomaly detector. 


10.2 BROADBAND OCEAN ACOUSTIC PROCESSING 

Acoustic sources found in the ocean environment are spatially complex and broad¬ 
band, complicating the analysis of received acoustic data considerably. A Bayesian 
approach is developed for a broadband source in a shallow ocean environment char¬ 
acterized by a normal-mode propagation model. Here we develop the “optimal” 
Bayesian solution to the broadband pressure-field enhancement and modal function 
extraction problem. 

10.2.1 Background 

Acoustic sources found in the hostile ocean environment are complex both spatially 
and temporally being broadband rather than narrowband. When propagating in the 
shallow ocean these source characteristics complicate the analysis of received acoustic 
data considerably—especially in littoral regions providing an important challenge for 
signal processing [19-24]. It is this broadband or transient source problem that leads 
us to a Bayesian signal enhancement solution. 

The uncertainty of the ocean medium motivates the use of stochastic models to cap¬ 
ture the random nature of the phenomena ranging from ambient noise and scattering 
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DX-deviation (pixels) 

FIGURE 10.10 Three-sigma error ellipsoid and estimated XY-position deviations: (a) Mea¬ 
sured data (o) no anomaly, (b) Measured data (o) with anomaly. Enhanced (+). 
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to distant shipping and the nonstationary nature of this hostile environment. When 
contemplating the broadband problem it is quite natural to develop temporal tech¬ 
niques especially if the underlying model is the full wave equation; however, if we 
assume a normal-mode propagation model then it seems more natural to: (1) filter the 
broadband receiver outputs into narrow bands; (2) process each band with a devoted 
processor; and then (3) combine the narrowband results either coherently [25] or 
incoherently [26, 27] to create a broadband solution. One apparent advantage to this 
approach is to utilize narrowband signal processing techniques thereby providing 
some noise rejection and intermediate enhancement for the next stage. This is the 
approach we take in this section to construct a broadband Bayesian processor (BP), 
that is, we first decompose the problem into narrow bands and construct a bank of 
Bayesian processors-one for each band. Finally, we incoherently or coherently com¬ 
bine the outputs of each to provide an overall enhanced signal useful for detection 
and localization. 

Employing a vertical array of hydrophone sensors, the enhancement of noisy 
broadband acoustic pressure-field measurements using a multichannel Bayesian tech¬ 
nique is discussed. Here the Bayesian approach is developed for the broadband source 
using a normal-mode propagation model. Propagation theory predicts that a different 
modal structure evolves for each spectral line; therefore, it is not surprising that the 
multichannel Bayesian solution to this problem results in a scheme that requires a 
“bank” of processors—each employing its own underlying modal structure for the 
narrow frequency band that it operates over. The Bayesian solution using state-space 
forward propagators is developed and shown that each processor is decoupled in 
modal space and recombined in the measurement space to provide enhanced estimates 
[28, 29], 

The methodology employed is based on a state-space representation of the normal¬ 
mode propagation model [28]. When state-space representations can be accomplished, 
then many of the current ocean acoustic processing problems can be analyzed and 
solved using this more revealing and intuitive framework which is based on firm 
statistical and system theoretic grounds. In this application, we seek techniques to 
incorporate the: (1) ocean acoustic propagation model; (2) sensor array measurement 
model; and (3) noise models (ambient, shipping, surface and measurement) into the 
processor to solve the associated signal enhancement problem. 


10.2.2 Broadband State-Space Ocean Acoustic Propagators 

In this section we discuss the development of a broadband propagator eventually 
employed in a Bayesian scheme to enhance noisy pressure-field measurements from 
a vertical array of hydrophone sensors. First, we briefly discuss the propagator from 
normal-mode theory following the Green’s function approach [32] and then extend 
it using a state-space representation to develop a forward propagation scheme for 
eventual use in Bayesian processor design. 

It is well-known [32-34] in ocean acoustics that the pressure-field solution to the 
Helmholtz equation under the appropriate assumptions can be expressed as the sum 



10.2 BROADBAND OCEAN ACOUSTIC PROCESSING 385 


of normal modes 


M 

p{r, z ,t)=Y_Z aU o (K r (m)r)(l> m {z s )(l> m (z)e l0,ot (10.5) 


where p is the acoustic pressure-field; a is the source amplitude; H 0 is the zeroth- 
order Hankel function; <f> m is the m th modal function evaluated at z and source depth 
z s ', K r {m) is the horizontal wave number associated with the m th mode; co 0 is the 
temporal source frequency, and r is the horizontal range. The wave numbers satisfy 
the corresponding dispersion relation 

2 

k 2 = ^—= K 2 r (m) + K 2 Am), m=l .M (10.6) 

c 2 (z) 

where k z is the vertical wave number with c the depth-dependent sound speed profile. 
Taking the temporal Fourier transform of the pressure-field, we obtain 

M 

p(r, z,a>) — ^2 aHo(Kr(m)r)(p m (z s , (o)<p m (z, co)S(co - co 0 ) (10.7) 


indicating a narrowband solution or equivalently a line at co a in the temporal frequency 
domain. 

This modal representation has been extended to include a broadband source, s(t), 
with corresponding spectrum, S s (co). In this case, the ocean medium is specified by its 
Green’s function (impulse response) which can be expressed in terms of the inherent 
normal modes spanning the water column 

G(r, z, a>) — H 0 (ic r (m)r)<p m (zs, co)<p m (z, oo) (10.8) 

and therefore the resulting pressure-field in the temporal frequency domain is given by 

p{r, z, co) = Q(r, z, o))S{co) (10.9) 

which equivalently corresponds to a convolution of G(r, z, t) and sit) in the time 
domain. 

Suppose we decompose the continuous source spectrum into a sampled or discrete 
spectrum using a periodic impulse (frequency) sampler, then it follows that 

5 s (cu) = A co^2S(co)8(co - co q ) = A (o^S(co q )8(co - co q ) (10.10) 


from the sampling properties of Fourier transforms. Therefore a broadband signal 
spectrum can be decomposed into a set of narrowband components assuming an 
impulse sampled spectrum. 
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Utilizing this property in Eq. 10.10 and extracting just the q th source frequency, 
we have that 


p(r,z,co q ) = g(r,z,co)S s (cD q ) (10.11) 

where S(co q ) can be interpreted as a single narrowband impulse at co q with amplitude, 
a q = Aco\S(co q )\. Suppose that the broadband source is assumed to be bandlimited 
and sampled S s (oj), (o\<co<wq then 


Q 

S„((o) = Aco J2 S((D q )8(co - co q ) (10.12) 


Thus, the normal-mode solution to the Helmholtz equation for the broadband 
source problem can be decomposed into a series of narrowband solutions, that is, 

M q 

p{r,z,oo q ) = ^ a q 'H 0 (K r (m,q)r)<p m (z s ,co q )<l) m (z,co q ) (10.13) 

m= 1 

and the dispersion relation now satisfies 

(O 2 

K 2 {m,q) = -wpr — K 2 (m,q), m = 1,... ,M q \ q = 1, • • •, Q (10.14) 
c 2 (z) 

This overall decomposition of the field into narrowband components could lead to 
an enhancer averaging each narrowband component of source frequency and would 
be the superposition of the pressure-field of Eq. 10.13 given by 

Q 

p{r,z)=Y.P^ z ^ ( 10 - 15 > 


assuming an incoherent approach as depicted in Fig. 10.11. 

Suppose we further assume an L-element vertical sensor array, then z,—*zi, 
1=1, ...,L and therefore, the pressure-field at the array for the q th temporal 
frequency of Eq. 10.13 becomes 


M q 

p{r,Zl,(tiq) = Pm(r,Zt,(Dq)(t> m (zt,(D q ) (10.16) 

where ze, co q ) := a 9 H 0 (/c r (m, q)r)cj) m (z s , co q ) is the m th -modal coefficient at the 
q th temporal frequency. 

Thus, following Eq. 10.15 the broadband pressure-field at the t ^-sensor is simply 
Q 

P(r,zt) = YYjP^'^^q) (10.17) 
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FIGURE 10.11 Decomposition of broadband signal into narrowband components based 
on normal-mode propagation representation. 


which corresponds to incoherently summing all of the narrowband solutions over the 
discrete spectral lines (bins) [25]. 

Thus, the amount of spectral energy “seen” at the £ ?/i -sensor in the <y ;/i -spectral bin 
(co q = qAco) is defined by p{zi,(o q ) and therefore the total energy seen by the array in 
the q th -bin is given by the set of sensor outputs, P(co q ) = [p(z \, oj q ),.. .p(zL, a>q)}- This 
implies that the subsequent processor must be able to process Q temporal frequencies 
over the array of L-sensors or must have LQ processors. One obvious technique 
would be to collect the array snapshots at each frequency and incoherently average 
the results, that is, 

1 2 _ 

P(co)=-J2 p (<* q ) (10-18) 

^ 9=1 

Using this approach implies that the “temporal incoherent processor” replaces the 
noisy broadband signal P(co) with the smoothed signal P(w). 

It is well known [29] that the state-space propagators for the narrowband pressure- 
field can be obtained from the relationship 

p(r,z,t) = tM(r)<t>(z)e-j° J ° t . (10.19) 

Assuming that the source range is known (r = r s ) and transforming to the temporal 
frequency domain, we have that 

p{z, (o 0 ) = T-Lo(K r rs)(t)(z)8(oj - Wo) (10.20) 

with l-Lo(K r r x ) the zeroth-order Hankel function. If we sample the spatial pressure-field 
with a vertical line array of hydrophones as before, then we have 


p{zi,co 0 ) = [fh(r s ,z s ,co 0 ) 0|... | P M (r s ,z s ,co 0 ) 0 Mze,co 0 ) 


( 10 . 21 ) 
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with fi m {r s , Zs, co 0 ) = ‘U 0 (K r r s )<j) m (zs, co a ) or simply 

p(zi,co 0 ) = p T (r s ,z s ,co 0 )<t>(zi,(Oo) (10.22) 

In terms of these models, it has been shown [29] that the broadband state-space 
propagator can be expressed as 

— «I>(z, co) = A (z, co)$(z, co) 
dz 

p{zi,co) = C T (r s ,z s ,co)<t>(ze,(o) (10.23) 

where $(z, co) e R 2Mxl , A(z, co) e C 2Mx2M , C T (z, co) e R lx2M with M = M q . 
Here we have implicitly assumed an incoherent sum over the set of temporal frequen¬ 
cies for our pressure-field measurement model. Note we use the parameter “co” to 
signify the entire set of discrete temporal frequencies, {co q }, q = 1,..., Q. 

The internal structure of this overall processor admits the following decomposition: 



"A(z, o>i) 

0 

0 

®(z,CO\)~ 

d 

O 

A(z, co 2 ) 

0 

*(Z,C02) 

-*(z,co) = 
dz 

O 

O 

• A(z, co Q )_ 

_<1>(z,coq)_ 


(10.24) 


and A(z, co q ) = diaglA^Z, co q )... A Mq (z, co q )] e R 2M * x2M * and 


0 r 

= _ —K 2 (m,q) 0_ ’ 


(10.25) 


with the incoherent pressure-field sensor measurement model given by (see Fig. 10.11) 


p(zi, co) = [C r (r 5 , z s ,(Oi)... C q(t s , z s , coq)] 




(10.26) 


\_<t>(ze,co Q )j 

This deterministic model can be extended to a stochastic Gauss-Markov represen¬ 
tation [11] given by 


—4>(z,co) = A(z, co)&(z, co) + w(z, co) 
dz 

p(ze,co) = C T (z, co)*(ze, co) + v(z, co) (10.27) 

where w, v are additive, zero-mean Gaussian noise sources with respective spec¬ 
tral covariance matrices R,„ u ,(z, co), and R>„,(z, co). Next we investigate the internal 
structure of the broadband Bayesian processor. 
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10.2.3 Broadband Bayesian Processing 

The broadband pressure-field can be characterized by the function, p(z,co). If we 
assume that the field is measured by a vertical array, then at each sensor the acquired 
time series is given by p(zt, t ). Clearly, this sensor measurement contains all of the 
information about the source, both temporally and spatially, but due to the com¬ 
plexity of this received data coupled with the noise, the required signal processing 
is quite a challenging problem. If we take the Fourier transform in the temporal 
domain, then the broadband source can be thought of as being viewed through a 
bank of narrowband temporal filters, that is, the broadband pressure-field surface or 
image is then given by: p( z, co) —>■ p(zt, co q )\ for l = 1,... ,L, and q=l,...,Q. Once 
this surface is constructed, it is possible to infer other useful information about the 
dynamics of the ocean as well as the source. If we assume further that we have a 
shallow water environment characterized by trapped wave propagation, then we may 
represent the underlying Green’s function or channel impulse response in terms of 
a normal-mode model. In this case it may also be of interest to observe the surface 
created by the various broadband modal functions, that is, as a function of temporal 
frequency, <j> m (ze,co q y, for m= 1,..., M q . It is clear that as we decompose this com¬ 
plex temporal pressure-field measurement into these narrow frequency bands, each 
band will contain oceanographical and source information. Thus, the problem that we 
address first here, is that of receiving a set of temporal noisy broadband pressure-field 
measurements and developing the enhancement necessary to construct the surfaces 
created by both the broadband pressure-field and modal functions. 

With this problem in mind, we now recast our incoherent measurement model of 
Eq. 10.26 into that of a broadband system obtained by “stacking” all of the narrow- 
band pressure-field measurements, rather than performing the coherent or incoherent 
addition. The resulting vector broadband pressure-field measurement model at the I th 
vertical sensor is now given by 

p (ze,co) = C T (r s ,z s ,(o)<S>(zi,co) + \(zc,co) (10.28) 


where p, ve C 2 * 1 , C r eR Qx2M , and e R 1My 1 . Expanding this model over the set 
of discrete temporal frequencies {co q }, q = 1,..., Q, we have 


P(ze,a>i) 


~C T (z t ,co i) •• 

O 

®(zi, COl) 


v(ze,a>i) 

p(Zt,0) Q )_ 

~ 

O 

■ C T (zi,coQ)_ 

_&(ze,coQ)_ 

+ 

_v(Zi,CO Q )_ 


(10.29) 


We assume that the corresponding measurement spectral covariance is given by: 

R vv (ze, co) = diag[R vv (zi,cn\)... R vv (z.t,oz Q )] (10.30) 

In order to obtain the optimal (minimum error variance) estimator, we cast our 
problem into a probabilistic framework under these Gauss-Markov assumption; 
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therefore, our sequence of pressure-field measurements at each sensor are Fourier 
transformed to yield a discrete set of frequencies {co q }, q = 1,..., Q and the set of 
broadband sensor measurements, {p(ze, co)}, £ = 1,..., L; p e C^ x 1 . Note that the 
window length of the Fourier transform is determined by the temporal correlation time 
of the measurement (source) to assure the independence of the frequency samples. 
For our pressure-field/modal function estimation problem, we define the underlying 
broadband pressure-field/modal function enhancement problem as: 

GIVEN a set of noisy broadband pressure-field measurements, {p (ze,co)},£ = 1,... ,L 
and the underlying Gauss-Markov model (previous section), FIND the best (min¬ 
imum error variance) estimate of the broadband modal functions, i>(z,e, co), and 
pressure-field, p(zi, co). 

The Bayesian solution to this problem can be obtained by the maximizing the a 
posteriori density. Define the set of broadband pressure-field measurements as: Pl = 
{pCzi ,co),... , p(zi, co)};p € ce* 1 , then the maximum a-posteriori {MAP) estimator 
of the modal functions must maximize the posterior density given by 


Vr(Mzt,oj)\Pd^ 


P r{<t>(ze,co),P e ) 


but it follows directly from Bayes’ rule that we have 

, ID . Pr{p{zt,(o)\<S>{zi,co),Pi-i)ycPx{tSf{zt,(o)\Pi-i) 

Pr($(z«,«)|P<!)= - —— -—— - 

Pr(p(zc,<u)|/Vi) 

Under Gauss-Markov assumptions, we have that 


Pr(p(z^,ft))|P, i) ~ U{C{r s ,z s ,co)^{zt\i-\,co), 

C{r s ,Zs,co)V{zi\i-i,co)C T {r s ,z s ,(o) + R vv (zi-i,co)) 
Pr{p{zi,(o)\®{zi,(o),P(.-ij> ~ N{C{r s ,z s ,(o)^{zt,(o),R vv {zi-i,co)) 

Pr($fa,o;)|P £ _i) ~ M{*{zt\i-i,co),V{zi\ t -i,a>)) (10.33) 


where the modal estimation error covariance is given by [29] 

P(ZT|T-i,o>) = , co)P(zf ,-1 \t— i, co)A t (zc -i , CO) + R ww (ze-i,co) (10.34) 

Here the notation &{zi\i-i,co) = E{Q>{zi,co)\Pi-\} is the conditional mean, that is, 
the “best” (minimum variance) estimate at depth zi based on the previous sensor 
measurement up to the depth ze-i ■ 

Now substituting these distributions into the a posteriori density and perform¬ 
ing the necessary manipulations [17], we obtain the desired relations. That is, by 





10.2 BROADBAND OCEAN ACOUSTIC PROCESSING 391 


maximizing the a posteriori density or equivalently its logarithm, we obtain the MAP 
equation 

^ln Pr(*(«,„)|P ( )|_ w = » (10.35) 

Differentiating the posterior density, setting the result to zero and solving for <f>( •) 
gives the desired MAP estimator [29] 

00 ) — &(zt K-1, ©) + K (ze, oo)e(zi, co) (10.36) 

where &(z,co q )€R 2Mxl , K(ze,co)eC 2MxQ , and e(zi,oj) e C Qx 1 is the corrected 
estimate (below) and shown in the linear space-varying broadband Kalman filter 
algorithm. The overall structure of the estimator can be seen by expanding the gain 
matrix over the set of discrete frequencies for co —> oo q , one for each of the Q-columns 
to give 

Q 

*(zt|<,®) = Mzm-uco) + Y J K^i,oo q )e{zi,oo q ) (10.37) 

where <J>(z, co) e R 2Myl ,K(z e , oo q ) e C 2Mxl , and e(z t ,co q ) eC lxl . Now let us rewrite 
this equation in a slightly different manner by expanding over the set of discrete 
frequencies and expressing the gain in terms of 2 M q x Q block rows K q (ze,co) 
which have been decomposed further into its individual column vectors defined by 
K qn (zi, oo n ) e C 2M?X 1 to obtain the narrowband recursion for the corrected estimate as 

Q 

\e,co q ) — &(zi\i-\,co q ) + J2^n(ze,(o„)e(ze,co n ) (10.38) 


with &(ze\e,co q ) e c 2M « xl the gain K qn (ze,oo n ) e c 2M « xl , and e(ze,co n ) e C lxl show¬ 
ing how each narrowband frequency line co q can be combined to form the optimal 
MAP estimate. 

These relations suggest an efficient parallel, but suboptimal approach to imple¬ 
menting this broadband processor might be achieved by constructing a “local” 
narrowband processor for each spectral line co q and then combining their outputs 
to obtain the final broadband estimate, that is, 


®(ze\t,co q ) = &{zt\i-i,co q ) + K qq (ze,a) q )e(ze,co q ) (10.39) 


where we have discarded the other 2M q x 1 submatrices, K qn (ze,co n ) = 0n^q 
to give 




P(zt\e-i,co q )C T (r s ,z s ,co q ) 
° Jl1 R eAze,co q ) 


The structure of the suboptimal broadband implementation of the BP is illustrated 
in Fig. 10.12. Thus the optimal algorithm will consist of a bank of narrowband 
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Sensor DFT 

array processors 



FIGURE 10.12 Structure of the broadband Bayesian processor using narrowband 
modal/pressure-field decomposition. 


Bayesian predictors combined during the update (correction) phase of the algo¬ 
rithm to create the broadband MAP estimator. The algorithm then proceeds for each 
W> <7=1,..., Q as: 


Prediction: 

J*{z,a> q ) = A(z,a> g )*(z,a> q ) 

p(zt,co q ) = C r (r s ,Zs,co q )Q(zm-i,co q ) 

Innovation: 

efe, (Oq) = p(zt, COq ) - p{Zl, (O q ) 

Correction: 


Q 

&(.zt\t, (o q ) = &(ze\e-i,co q ) + ^ K qn (zi, (o„)e(zi, co n ) 


Gain: 


K (zi,co) = ?(zi\i-i,a>)C T \r s ,Zs,(o q )Rj (ze,m) 


(10.40) 
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FIGURE 10.13 Shallow ocean environment problem: channel (100 m) with broadband 
(50 Hz) source located at range, r s = 10 Km and depth, z s =50 m. 


This completes the development of the broadband Bayesian processor. Next we 
consider its application. 


10.2.4 Broadband BSP Design 

In this subsection we discuss the application of the Bayesian processor to data synthe¬ 
sized by a broadband normal-mode model using the state-space forward propagator 
and the underlying Gauss-Markov representation. 

Let us consider a basic shallow water channel depicted in Fig. 10.13. We assume 
a flat bottom, range independent three layer environment with a channel depth of 
100 m, a sediment depth of 2.5 m and a subbottom. A vertical line array of 100-sensors 
with spacing of Az = 1 m spans the entire water column and a broadband source of 
unit amplitude and 50 Hz bandwidth ranging from 50-100 Hz. in 10 Hz. increments is 
located at a depth of 50 m and a range of 10 Km from the array. The sound speed profile 
in the water column and the sediment are sketched in the figure and specified along 
with the other problem parameters. SNAP, a normal-mode propagation simulator [35] 
is applied to solve this shallow water problem and executed over the set of discrete tem¬ 
poral source frequencies. This boundary value problem was solved using SNAP and 
the results at each narrowband frequency are shown [29]. We note that as the temporal 
frequency increases, the number of modes increases thereby increasing the corre¬ 
sponding order of the state-space. Details of the problem parameters are given in [29]. 
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The parameters obtained from SNAP are now used to construct the broadband 
state-space and measurement models of the previous subsection. Here we use the set 
of horizontal wave numbers, {/c(m, q)},m = \,..., M q ;q = \. ,Q, and sound speed, 
{cfe)}, to implement the state models along with the corresponding modal function 
values, {<j>mi(Zs,co q )}, as well as the Hankel functions, {'H 0 (.K(m,q)r s )} to construct 
the measurement models. 

The final set of parameters for our simulation are the modal and measurement noise 
covariance matrices required by the Gauss-Markov model (see [29] for more details). 

It is important to realize that the state-space “forward” propagators do not offer 
an alternative solution to the Helmholtz equation (not to be confused with a marching 
method), but rather use the parameters from the boundary value solution to obtain a 
set of initial conditions/parameters for the propagator construction. Even the adap¬ 
tive Bayesian processors still utilize the boundary value solutions to “initialize” the 
processing [28], 

With this information in hand, the Gauss-Markov simulation was performed at 
SNR in = 25 dB (noise free) and SNR out = —30 dB. The “true” pressure-field surface 
is shown in Fig. 10.14 along with the corresponding noisy surface—both produced 



FIGURE 10.14 Synthesized broadband pressure-field surface: (a) True pressure-field, 
(b) Noisy (-30 dB) pressure-field, (c) Narrowband DFT filter outputs of true field. 
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as outputs of a set of narrowband DFT filters at each co q . We also show the true 
pressure-field functions which are expected to be extracted by the optimal processor 
along with the modal function estimates. 

The optimal Bayesian or minimum variance processor was designed using the 
identical set of parameters used in the Gauss-Markov simulation thereby eliminating 
any “mismatch” between model and environment. We can consider this simulation as 
a bound on the best one could hope to achieve, since it is in fact the minimum variance 
estimates satisfying the Cramer-Rao lower bound. In minimum variance estimation 
it is important to realize the overall design philosophy. First, the key issue is that 
a necessary and sufficient condition for optimality is that the innovation sequences 
(difference between measured and predicted pressure-fields) are zero-mean and uncor¬ 
related (white). Thus, actual minimum variance designs are not considered “tuned” 
unless this condition is satisfied; therefore, the free parameters in the processor (usu¬ 
ally initial conditions and process/measurement noise vectors/matrices) are adjusted 
(if possible) until this condition is achieved. Once satisfied, then and only then can the 
state (modal function) and measurement (pressure-field) estimates along with their 
associated covariances be considered viable. 

Thus, overall performance of the processor can be assessed by analyzing the sta¬ 
tistical properties of the innovations, which is essentially the approach we take in 
this feasibility test for the broadband processor design on synthesized data. There are 
other tests that can be used with real data to check the consistency of the processor 
(see Chapter 5 for more details). 


10.2.5 Results 

We use SSPACK_PC [30], a Bayesian processing toolbox available in MATLAB [31] 
to design our broadband MBP. The results of the minimum variance design are shown 
in Fig. 10.15, where we see the enhanced pressure-field and the corresponding inno¬ 
vations sequences at each discrete temporal frequency as a function of depth. Each 
of the innovations sequences tested zero-mean and white with the following test 
results: 50 Hz: (2% out; 0.08 < 7.5); 60 Hz: (2% out; 0.98 < 2.5); 70 Hz: (3.9% 
out; 0.66 < 2.6); 80 Hz: (2% out; 0.70 < 3.8); 90 Hz: (0% out; 0.70 < 1.9); 100 Hz: 
(2% out; 1.90 < 3.4). The corresponding WSSR statistic lies below the threshold in 
Fig. 10.16. Thus we have (as expected) achieved a minimum variance “broadband” 
design. Note that the enhanced pressure-field estimate at each temporal frequency, 
co q = [50,60,70,80,90,100], is governed by the Gauss-Markov model of the pre¬ 
vious subsection. The corresponding modal functions were then extracted from the 
noisy pressure-field producing viable estimates. The estimated modal functions cor¬ 
respond to the two (2) modes at 50 Hz and three (3) at 60 Hz. The other estimated 
modes (from the noisy data) are at 70 Hz (3 modes), 80 Hz (4 modes), 90 Hz (4 modes) 
and 100 Hz (5 modes). Again note that the modal estimates &(ze, oj q ) along with the 
measurement model at each temporal frequency, C T (r s , z s , co q ) are used to construct 
the pressure-field, p (zi,co q ) at each temporal frequency solving the enhancement 
problem. 
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FIGURE 10.15 Broadband Bayesian processor design: (a) Enhanced pressure-field 
estimates, (b) Innovation sequences. 



Samples (m) 

FIGURE 10.16 Innovation sequence whiteness testing: WSSR statistic (Window = 35 
samples; Threshold = 250. 
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This completes the application of the broadband Bayesian processor designed to 
enhance the pressure-field surface and extract the corresponding modal functions. 

10.3 BAYESIAN PROCESSING FOR BIOTHREATS 

The design of a “smart” physics-based processor for microcantilever sensor arrays to 
detect various target species in solution based on the deflections of a functionalized 
array is discussed. A proof-of-concept design is demonstrated and shown to perform 
quite well on experimental data. 

10.3.1 Background 

Smart sensors with embedded processors offer unique advantages for applications 
that must gather large amounts of data and continuously monitor evolving conditions 
for potential changes (e.g., machine condition monitoring) or potential threats (e.g., 
biological, chemical, nuclear). Unfortunately, the usual processing techniques such 
as nonparametric methods like wavelets or parametric methods like autoregressive- 
moving average ( ARMA ) models do not capture the true essence of the problem 
physics required to extract the desired information, detect the change or monitor 
the environment for threats. The underlying physical phenomenology governing the 
propagation physics is usually quite complex governed by nonlinearities typically 
characterized by nonlinear differential/difference equations. Coupling the resulting 
nonlinear processor to the sensor performing the measurement has not been considered 
a realistic possibility until now with the evolution of high-speed microcomputer chips 
that can easily be incorporated into the sensor design. We consider the design of 
an algorithm coupled to a microelectromechanical sensor (MEMS) to estimate the 
presence of critical materials or chemicals in solution. 

Microcantilevers are powerful transducers for sensing inorganic, organic and 
biological molecules, since they readily bend or deflect in the presence of a very 
small number of target molecules (nanomolar to femtomolar concentrations) [36] as 
shown in Fig. 10.17. The number of potential target chemicals is large, ranging from 
DNA [37] to explosives [38], implying that these sensors may be useful in defense, 
medicine, drug discovery, and environmental monitoring. Microcantilevers are capa¬ 
ble of recognizing antibodies [39] and nerve agent products (hydrofluoric acid) in solu¬ 
tion [40]. However, a major limitation of these sensors is that their signal-to-noise ratio 
(SNR) is low in many operational environments of interest. Therefore, we discuss the 
design of a “smart sensor” design combining the array with a physics-based processor 
to minimize its inherent limitations and maximize the output SNR for enhancement. 

We investigate the physics-based Bayesian approach [ 17] to develop a multichannel 
processor evolving into a smart sensor for this application. This approach is essentially 
incorporating mathematical models of both physical phenomenology (chemistry/flow 
dynamics) and the measurement process (cantilever array including noise) into the 
processor to extract the desired information. In this way the resulting Bayesian signal 
processor (BSP) enables the interpretation of results directly in terms of the problem 
physics. 
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FIGURE 10.17 Micromachined cantilever array: (a) Eight(8)-element lever array, 
(b) Lever deflection. 

We discuss the design of physics-based signal processing to micromachined can¬ 
tilever measurement arrays to estimate the critical materials in solution. We briefly 
present the underlying physical phenomenology and reduce it to a simple model for 
processor development. Unknown parameters in this model are “fit” from indepen¬ 
dent experimental data. Once these parameters are estimated, we use minimum error 
variance techniques for the BSP design [17]. We then apply the resulting processor 
to experimental data demonstrating the overall enhancement that would lead to an 
eventual “smart sensor” design. The resulting processor is based on nonlinear evo¬ 
lution equations leading to an extended Bayesian processor (XBP) or classically, the 
extended Kalman filter, ( EKF) that can be implemented in an on-line manner yielding 
the enhanced data as its output. 

10.3.1.1 Microcantilever Sensors The dynamics ofthe fluids flowing over the 
cantilever array of Fig. 10.17 is influenced by two major factors: temperature and flow. 
The temperature is dependent on many variables and the dynamics are relatively slow, 
creating a disturbance to the cantilever sensor system. Flow in the medium associated 
with the array induces a stress along with the chemical forces created by the molecules 
bonding with the functionalized levers leading to the measured deflection. 

Micromachined cantilevers can function as detection devices when one side is 
fabricated to be chemically distinct from the other, as shown in Fig. 10.17b. Func¬ 
tionalization can be accomplished, for example, by evaporating a thin (10’s of nm) 
film of metal such as gold on the top of the chip, then immersing the cantilever chip in 
a “probe” chemical that will bind preferentially to the thin gold film. The lever acts as 
a sensor when it is exposed to a second “target” chemical that reacts with the probe, 
since the reaction causes a free energy change that induces stress at the cantilever 
surface. Differential surface stress, A a, in turn, induces a deflection of the cantilever 
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that can be measured optically or electronically. In this subsection, we describe results 
of experiments with gold-coated cantilevers exposed to 2-mercaptoethanol, a small 
sulfur-terminated molecule with high affinity for gold. 

When the microcantilever is immersed in a fluid and it has been functionalized 
to attract the target molecules, the changes in surface stress can be predicted as a 
function of surface loading. The evolution dynamics of this chemical interaction is 
captured by the well-known Langmuir kinetics in terms of a set of ordinary nonlinear 
differential equations. Rather than propagate these equations directly, we choose 
to use their solution in our processor that accounts for the adsorption-desorption 
kinetics. We developed an approximation to the Langmuir evolution equations based 
on a stirred tank reactor to estimate the target concentration as a function of time 
under continuous flow conditions. Experimentally, the applied chemical input signal 
is a constant concentration initiated by a step (function) increase at time, toN , and 
terminated at time, toFF [41]. 

Based on this representation of the process evolution physics, the dynamic (nor¬ 
malized) surface concentration of the interacting molecules on the surface of the 
cantilever, T(f) = is given by the relations 

Liter 


r(0 = 


( <-«>., A 

\c(t)+ka/k d ) 


{1 - exp [-(k a c(t) + k d )(t - t 0N )]) 

yj 2kd(t-toFF) 


t < toN 

tON < t < ? 0 FF 
t > t 0 FF 

(10.41) 


where k a , k d and c(t) are the respective adsorption rate constant [Af] _I ,s _1 , desorp¬ 
tion rate constant (cm _2 M _1 5 _1 ) and bulk concentration of the target molecules in 
solution (moles/liter) with Tmax , the maximum surface concentration of the species 
of interest (cm -2 M -1 ). 

The total free energy change of the cantilever surface, AG, is related to the surface 
stress difference Aer, between top and bottom side of the cantilever by 


Arx(f) = A G{t)T{t)/M A (10.42) 

where AG has units of J/mole and is the change in the sum of all of the contributions 
to the free energy of the surface of the cantilever with M,\ is Avogadro’s number (con¬ 
stant). The differential surface stress in the cantilever induces a chemically induced 
deflection, A z c (t), using a variant of Stoney’s equation [41], which implies that the 
deflection of the cantilever is directly proportional to the difference in surface stress 
(signal) on the cantilever surface relating this stress difference to the surface coverage 
and free energy of absorption, that is, 

A Z C (t) = pAcr(t) for 0 = V) (10.43) 

where E is the Young’s modulus, v is the Poisson’s ratio, and £ and S are the cantilever 
length and thickness, respectively. This model can be used to predict changes in 
surface stress as a function of surface loading. 
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The measurement model is more complicated, since it is the superposition of both 
the chemical and temperature deflection phenomena 

yt(t) = A zf(t) + A z T (t) for l = 1,..., L (10.44) 

where Azf is the chemical deflection, different for different cantilevers, and A z T (t) 
is the thermal deflection, assumed to be the same for all cantilevers. Since we have 
approximated the physics developing the well-founded formulation of Eq. 10.41, 
we know that there is uncertainty in both the measurements (noise) and the model 
parameters. Therefore, we cast the problem into a Gauss-Markov (GM) state-space 
framework [17] representing these uncertainties as additive Gaussian processes, 
that is, 


AG(f) = A G(t - 1) + wit - 1) [state] 

ydt) = fit? (t; ©)A G(t) + Az T (t) + v e (t) [measurements] (10.45) 

where l— 1,... ,L; T(f; 0) is given by Eq. 10.41 with unknown model parameters 
defined by the vector, © = [k a kd Pmax\ and free energy (state) modeled as a ran¬ 
dom walk (A G(t) = 0). This representation, therefore, creates the foundation for our 
physics-based processor design. The process uncertainty is modeled by w and the 
corresponding measurement uncertainty as v, both zero-mean Gaussian with respec¬ 
tive covariances, R ww and R vv . With this representation in mind, the cantilever signal 
enhancement problem is defined as: 

GIVEN a set of noisy deflection measurements {yt (?)) with known bulk concentration 
inputs, {fit)} and unknown parameters 0, FIND the best (minimum error variance) 
estimate of the deflection, y(t\t — 1), that is, the conditional mean at t based on the 
data up to time t — 1. 

The design of the processor for this problem is illustrated in Fig. 10.18 and Fig. 1.4, 
1.5. After the cantilever physics model is developed, it is used (1) to extract the required 
parameters using a physics-based parameter estimator; (2) to synthesize “data” for the 
initial processor designs; and (3) to enhance the noisy measurements being incorpo¬ 
rated into the final BSP structure. In the figure, we see that the complex mathematical 
model of Eq. 10.41 (dashed box) is used to perform the physics-based parameter 
estimation using independent experimental data to extract the required parameters 
(adsorption/desorption rate constants and maximum concentration) as well as ini¬ 
tially simulate data for BP design studies, once these parameters are extracted. The 
actual experimental data replaces the synthesized and is used to validate the processor 
performance. 

10.3.2 Parameter Estimation 

The basic approach is to first estimate the model parameters, ©, (off-line) from an 
independent set of deflection measurements, and then, incorporate them into the BSP 
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FIGURE 10.18 Physics-based approach to microcantilever "smart sensor" design: physics 
evolution model, parameter estimation (off-line) data, Bayesian processor and data 
enhancement. 


to enhance the experimental (proof-of-concept) data. That is, we extract the critical 
absorption, desorption rate constants and maximum concentration parameters for each 
channel as: = [k a (£) k c i(l) f max(£)]- The parameter estimator employed was a 

nonlinear least-squares solution using the Nelder-Meade polytope search algorithm 
[43]. This algorithm is based on minimizing 


N, 

min /(©) = e\(f, ©) for s(f, ©) := yi(t) — %(t; 0) (10.46) 

where the estimated cantilever measurement at the I th -lever is given by 

%{t- 0) = Azf (r; 0) + A z T (t) = &T(f; ©)AG(t) + A z T (t) (10.47) 

We executed this estimator on raw experimental deflection data and estimated the 
parameters for each lever. The extracted parameters reasonably predicted the filtered 
cantilever response and the resulting error was uncorrelated as discussed below. 

10.3.3 Bayesian Processor Design 

Next using these estimated parameters, we developed BSP for the multichannel deflec¬ 
tion data. The GM model was used to synthesize the multichannel data and provide 
known truth data to “tune” or adjust the processor noise covariance matrices that 
provide the “knobs” for BSP design (see Fig. 10.18 and [17] for details). Since the 
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FIGURE 10.19 Bayesian processing for the "average" microcantilever sensor array sig¬ 
nal: (a) Raw deflection data, (b) Raw temperature data, (c) Parameter estimation, 
(d) Bayesian estimation (enhancement). 


simulation model and that used in the processor are identical, optimal (minimum 
error variance) performance is achieved (zero-mean, uncorrelated errors) providing 
the starting point for application to the experimental data. The final (simplified) BSP 
algorithm (assuming the parameters, ©, have been estimated) is shown below. Here 
we see that the algorithm has a classical predictor-corrector form where the prediction 
estimates (conditional mean) the free energy, filters the measurement, estimates the 
innovation and corrects the final (filtered) free energy estimate. 

AG(t\t — 1) = A G(t — I \t — 1) [Free Energy Prediction] 

ye(t\t — 1) = &)AG(t\t — 1) + Az r {t) [Deflection Prediction] 

Ef(t) = yiit) - yi(t\t - 1) [Innovation or Residual] 

AG(t\t) = AG(t\t - 1) + k '(t)s(t) [Free Energy Correction] 

(10.48) 

where k is the corresponding weight or gain and © is the output estimate from the 
parameter estimator. 
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FIGURE 10.20 Bayesian processing of microcantilever experimental (proof-of-concept) 
data: Raw data, enhanced (BP) deflection measurement and residual results for each 
lever. 


10.3.4 Results 

First, we take the filtered measurement signals, average them to a single measurement, 
fit the parameters using an off-line optimization technique [43] and use the parameters 
in the Bayesian processor. The results are shown in Fig. 10.19 where we see the 
raw deflection and temperature measurements in a and b. The parameter estimator 
results are shown in Fig. 10.19c along with the state (signal) estimator in d. The 
parametric fit is quite reasonable as is the signal enhancement. However, it is clear 
from the innovations that the optimal processor is clearly not Gaussian, since it is not 
zero-mean or white. Next, we used the BSP with the free energy as our piecewise 
constant parameter (state) and the nonlinear cantilever array model with 6 elements 
along with a filtered estimate of the temperature profile in the processor, A z T (t). 
By tuning the measurement noise covariance parameters (R vv ) we demonstrate that 
the BSP is capable of tracking the noisy cantilever deflection data reasonably well; 
however, the performance is again suboptimal, since the innovations (shown in each 
figure), although quite small, are not uncorrelated. The results are shown in Fig. 10.20 
where we see the raw measured cantilever data, Bayesian processor estimates and 
the corresponding residual errors or innovations. The results are quite reasonable 
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except for the systematic bias error in the processor output at each lever. The bias 
is created by out lack of knowledge of the initial concentration input and can easily 
be compensated (gain) at each lever as well. The dynamics appear to be captured 
by the model especially in cantilever 5. From the figure we note that the dynamics 
of the individual levers (on-set and off-sets) are quite close to the expected. This 
design demonstrates that even complex physical systems can be incorporated into 
physics-based processors enabling the development of a “smart” sensor. 


10.4 BAYESIAN PROCESSING FOR THE DETECTION 
OF RADIOACTIVE SOURCES 

With the increase in terrorist activities throughout the world, the need to develop 
techniques capable of detecting radioactive sources in a timely manner is a critical 
requirement. The development of Bayesian processors for the detection of contraband 
stems from the fact that the posterior distribution is clearly multimodal eliminating 
the usual Gaussian-based processors. The development of a sequential bootstrap pro¬ 
cessor for this problem is discussed and shown how it is capable of providing an 
enhanced signal for detection. 

10.4.1 Background 

Radionuclide source detection is a critical technology to detect the transportation of 
illicit radiological materials by potential terrorists. Detection of these materials is 
particularly difficult due to the inherent low-count emissions produced. These emis¬ 
sions result when sources are shielded to disguise their existence or, when being 
transported, are in relative motion with respect to the sensors. This section addresses 
the first step in investigating the problem of enhancing radionuclide signals from 
noisy radiation measurements using a Bayesian approach. Some work has been 
accomplished on this problem [44-48]. Here we model the source radionuclides 
by decomposing them uniquely as a superposition (union) of monoenergetic sources. 
Each y-ray emitted is then smeared and distorted as it is transported on a path to the 
output of the detector for measurement and counting. 

We start with the “physics-based” approach to solving this suite of problems and 
then discuss the measurement system employed to detect y-rays and show how the 
mononenergetic approach leads to a compound Poisson driven Markov process [56] 
which is amplified, shaped and digitized for further processing. The processor is devel¬ 
oped using state-space representations of the transition probability and associated 
likelihood and we apply it to synthesized data to evaluate its performance. 

10.4.2 Physics-Based Models 

Radiation detection is the unique characterization of a radionuclide based on its elec¬ 
tromagnetic emissions. It has been and continues to be an intense area of research 



10.4 BAYESIAN PROCESSING FOR THE DETECTION OF RADIOACTIVE SOURCES 405 



FIGURE 10.21 Gamma-ray evolution and measurement: radionuclide source (EMS), 
medium transport (physics), detector material interaction, detector temporal response 
(preamplification/pulse shaping) and A/D conversion with quantization noise. 


and development for well over 50 years [57-61]. It is well known that a particu¬ 
lar radionuclide can be uniquely characterized by two basic properties: its energy 
emitted in the form of photons or gamma-rays (y-rays) and its radioactive decay rate. 
Knowledge of one or both of these parameters is a unique representation of a radionu¬ 
clide. Mathematically, we define the pair, [{e,}, {a,}], as the respective energy level 
(MeV) and decay rate (probability of disintegration/nuclei/sec) of the i th -component 
of the elemental radionuclide. Although either of these parameters can be used to 
uniquely characterize a radionuclide, only one is actually necessary—unless there 
is uncertainty in extracting the parameter. Gamma ray spectrometry is a methodol¬ 
ogy utilized to estimate the energy (probability) distribution or spectrum by creating 
a histogram of measured arrival data at various levels (counts vs. binned energy) 
[58]. It essentially decomposes the test sample y-ray emissions into energy bins 
discarding the temporal information. The sharp lines are used to identify the cor¬ 
responding energy bin “detecting” the presence of a particular component of the 
radionuclide. In the ideal case, the spectrum consists only of lines or spikes located 
at the correct bins of each constituent energy, e,-, uniquely characterizing the test 
radionuclide sample. 

Gamma-ray interactions are subject to the usual physical interaction constraints of 
scattering and attenuation as well as uncertainties intrinsic to the detection process. 
Energy detectors are designed to estimate the y-ray energy from the measured elec¬ 
tron current. A typical detector is plagued with a variety of extraneous measurement 
uncertainty that creates inaccuracy and spreading of the measured current impulse 
(and therefore y-ray energy). The evolution of a y-ray as it is transported through the 
medium and interacts with materials, shield and the detector is shown in Fig. 10.21. 
It is important to realize that in the diagram, the source radionuclide is represented 
by its constituents in terms of monoenergetic (single energy level) components and 
arrival times as f(e,-, T i)- Since this representation of the source radionuclide contains 
the constituent energy levels and timing, then all of the information is completely 
captured by the sets, [{e,}, {r,}], i= 1,..., N e . The arrivals can be used to extract 
the corresponding set of decay constants, {A.;} which are reciprocally related (1/mean 
rate). Thus, from the detector measurement of arrivals, or equivalently the so-called 
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FIGURE 10.22 Monoenergetic source decomposition: individual source constituent EMS 
from ideal composite (superposition). 


event mode sequence (EMS), a particular radionuclide can be uniquely characterized. 
The constituent energy levels (spikes), {e,} and arrival times, {t;}, extracted from 
the EMS are depicted in Fig. 10.22, where we show the union (superposition) of 
each of the individual constituent monoenergetic sequences composing the complete 
radionuclide EMS. Note that there is no overlapping of arrivals—a highly improbable 
event. 

So we see that the signal processing model developed from the transport of the 
y-ray as it travels to the detector is measured and evolves as a distorted EMS. 
First, we develop a model of the event mode sequence in terms of its monoen¬ 
ergetic decomposition. Define f(f; 6,-, t,-, A,) as the component EMS sequence of 
the i th -monoenergetic source at time t of energy level (amplitude), e, and arrival 
time, xi with decay rate, A—as a single impulse, that is, £(f; e,-, r,-, A,) = e,<5(t — t,-) 
and rate A.,. Thus, we note that the ideal EMS is composed of sets of energy¬ 
time pairs, {e,, r,}. In order to define the entire emission sequence over a specified 
time interval, [t 0 , T), we introduce the set notation, r ; := (t,(1) ... Xi(N e (i)) } at 
the n^-arrival with N f (i) the total number of counts for the ^-source in the inter¬ 
val. Therefore, §(f;e,-, t ; ,A,) results in an unequally-spaced impulse train given by 
(see Fig. 10.22) 


NM N e (i) 

i&; a, I,-, A;) = € i’ *i) = J2 €iS ^ - r ' (M)) (10 - 49) 


The interarrival time, is defined by At,-( n) = x ,-(n) — xi(n — 1) for At,(0) = t 0 
with the corresponding set definition (above) of At,- for i = 1,..., N f (i) — 1. Next we 
extend the EMS model from a single source representation to incorporate a set of 
N e -monoenergetic sources. 
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Suppose we have a radionuclide source whose EMS is decomposed into its N e - 
monoenergetic source components, £(f; e, r, A). From the composition of the EMS we 
know that 


N e N € (i) 


N e N € (i) 

Ti(n),Xi) = € ‘ S(t ~ T ‘ (n)) 


(10.50) 


Clearly, since the EMS is the superposition of Poisson processes, then it is 
also a composite Poisson process [56] with parameters: A = ^i> e = Yl?=\ A, 

i W e (i) f° r k the total decay rate, e the associated energy levels and the 
total counts in the interval, [ t 0 , T). Note that the composite decay rate is the superpo¬ 
sition of all of the individual component rates. This follows directly from the fact that 
the sum of exponentially (Poisson) distributed variables are exponential (Poisson). We 
note that the (composite) EMS of the radionuclide directly contains information about 
A, but not about its individual components—unless we can extract the monoenergetic 
representation (Eq. 10.50) from the measured data. 

Statistically, the EMS can be characterized by the following properties: 

• non-uniform arrival time samples, r,-(n) 

• monoenergetic source components, £(f; e,-, r,(n), A,) having their own unique 
decay rate, A; 

• unique energy level, e ( - 

• gamma distributed arrival times, t,(«), T(A, r,) 

• Poisson distributed counts, N e (i), V{N f (n) = m) 

• exponentially distributed interarrival times, Ar,(n), £(A,-Ar ( -(n)) 

• composite decay rate, A 

Next we consider the measurement of the EMS along with its inherent 
uncertainties. 


10.4.3 Gamma-Ray Detector Measurements 

Using the mathematical description of the EMS in terms of its monoenergetic source 
decomposition model discussed previously, we show how this ideal representation 
must be modified because of the distortion and smearing effects that occur as the y- 
rays propagate according to the transport physics of the radiation process. Typically, 
these are quantified in terms of y-ray spectral properties of energy “peak width” 
and “peak amplitude”. The uncertainties evolve from three factors inherent in the 
material and instrumentation: inherent statistical spread in the number of charge 
carriers, variations in the charge collection efficiency and electronic noise [58]. In 
general, the energy resolution is defined in terms of a Gaussian random variable. 
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Next we consider uncertainties created in the associated pulse processing sys¬ 
tem that consists of an amplifier and pulse shaping circuits. Here we concentrate 
on the amplitude output of the pulse shaper, since it carries not only the quantified 
y -ray energy information, but also it is used for the detector timing circuits (gat¬ 
ing pulses, logic pulses, etc.). The shaped pulse is converted to a logic pulse in 
order to extract the energy amplitude and precise timing information (arrival times, 
interarrival times, etc.). We consider the pulse shaper circuitry capable of taking the 
“raw” detector charge pulse, amplifying and shaping it to create a Gaussian pulse 
shape [58], Once the Gaussian pulse amplitude, which is proportional to the orig¬ 
inal y-ray energy (after scaling), is digitized or quantized by the analog-to-digital 
converter (ADC), the critical EMS parameters, [{e,}, {r,), {A.,}], energy level, arrival 
time and decay rate can be extracted for further analysis and processing. From this 
data all other information can be inferred about the identity and quantity of the test 
radionuclide. 

Next we define a signal processing model that captures the major characteris¬ 
tics of a solid state detector in order to formulate our Bayesian approach to the 
radiation detection problem. Consider the diagram again of the overall detector sys¬ 
tem shown in Fig. 10.21. Here we see how the y -ray is transported through the 
medium (scattering and attenuation) to the detector. Each photon is deposited in the 
detector material, charge is collected and a charging current created. This current 
passes into measurement electronics that are also contaminated with random noise 
followed by the quantization to produce the noisy output measurement. Thus, from 
the ; (/ '-monoenergetic component we have 


JV e (0 

Pm,(t) = *i) * r (0 + w T,(t) 

N e (i) 

= J2 e i r(t ~ *i(«)5 + w nW (10.51) 


where r(f) is a rectangular window of unit amplitude defined within 
Tj(n) <t< Tj(n — 1). The uncertain (random) amplitude is Gaussian, e, ~AA(e,-, cr^.), 
with inherent uncertainty representing the material charge collection process time “jit¬ 
ter” by the additive zero-mean, Gaussian noise, w ti ~_AC(t er^ ) and r(n) — ► r,-(n); 
n— 1,..., Nf(i). Therefore, thematerial output pulse train for the ^'-source is given by 
s(t) = Hs(t) ★ Pmi(t) + v(t). Extending the model to incorporate all of the N e -sources 
composing the radionuclide leads to the superposition of all of the monoenergetic 
pulse trains, that is, p m (t)= YY=\ PmSt). The uncertain material pulse, p m (t), is 
then provided as input to the pulse shaping circuitry. Here the preamplifier and pulse 
shaper are characterized by a Gaussian filter with impulse response, Hs(t) with output 
given by 


s(t) = H S (t)*p m (t)+v(t) 


(10.52) 
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where the uncertainty created by instrumentation noise is modeled through the addi¬ 
tive zero-mean, Gaussian noise source, v~Af( 0,er^). The shaped pulse is then 
quantized (t k —> t) and digitally processed to extract the energy levels and timing 
information for further processing. Due to quantization limitations the ADC inher¬ 
ently contaminates the measured pulse with zero-mean, Gaussian quantization noise, 
v q {t k ) while there exists background radiation noise, b(t k ) that must also be taken into 
account. At this point, we could also develop a signal processing model of the back¬ 
ground, but we choose simplicity. We just simply model it as an additive disturbance 
at the output of the quantizer given by b{t k ) giving us the final expression at the output 
of the quantizer as 


z(tk) = s(t k ) + b{t k ) + v q (t k ) (10.53) 


With V q ~ Af(0, (Tq). 

So we see that the entire EMS can be captured in a signal processing model with 
the key being the monoenergetic source decomposition representation of radiation 
transport. Next we start with this model and convert it to state-space Markovian form 
directly for Bayesian processing. 

In our problem, the EMS is the noisy input sequence characterized by both input 
and noise processes, that is, £ and w T —»• w. The states are part of the preamplifier and 
Gaussian pulse shaping system and the output is the quantized measurement, that is, 
z(t k ) —> y(t). To be more specific, we use £(f; e,-, t/. A.;), the i th -monoenergetic source 
including both amplitude and timing uncertainties as a Poisson input to our Markovian 
model above along with the matrices, A, B, C, specifying the pulse shaping circuit 
parameters transformed to state-space form. 

To see this consider the state-space representation for a single monoenergetic 
source is given by the following set of relations: 

Xj(t) = AiXj(t ) + b,•£(*; €(, r ( -, A.,) + w,uj T ,(f) [Source] 
y(t) = c fXi(t) + v(t) [Pulse Shaper] 

z(t k ) = y(t k ) + Vq(t k ); i=l,...,N e [ADC] (10.54) 

Expanding this model over i to incorporate the N e -monoenergetic source components 
gives the extended state vector, x(t) = [x,(f) | X2(t) | ... | XN ( {t)\' where each compo¬ 
nent state is dimensioned N x and therefore, x e TZ NxNfXl . Thus, the overall radiation 
detection state-space model for N e monoenergetic sources is given by: A — diag[A ( ], 
B = diag[B i ],C=[c' | Id, | ... | c^]. 

It is interesting to note some of the major properties of this model. The first 
feature to note is that the monoenergetic decomposition of the radionuclide source is 
incorporated directly into the model structure. For instance, it we are searching for a 
particular radionuclide and we know its major energy lines that uniquely describe its 
spectrum, we can choose the appropriate value of N e and specify its corresponding 
mean energy levels and decay rates directly—this is the physics-based approach. 
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We also note that the corresponding noise and statistics are easily captured by this 
structure as well. This formulation is a continuous-discrete or simply “sampled-data” 
model, since the ADC in used in the detection scheme. 


10.4.4 Bayesian Physics-Based Processor 

In this section we discuss the development of a Bayesian processor for a problem of 
enhancing a noisy EMS measurement with all of the information required “known” 
a priori. We demonstrate how a radiation detector can be modeled (simply) from a 
physics/statistical signal processing perspective, develop the mathematical represen¬ 
tations and incorporate them into a Bayesian framework to enhance the constituent 
monoenergetic representation. We then demonstrate the Bayesian framework with an 
illustrative simulation. 

A simple radiation transport synthesizer was developed for signal analysis purposes 
[62]. It consists of specifying the radionuclide in terms of its EMS and corresponding 
monoenergetic source decomposition then transporting this sequence through the 
medium (shield) along with its inherent scattering to the detector. At the detec¬ 
tor the “surviving” or escaping y -ray photons are transported through the detector 
material (semiconductor) again being absorbed and scattered with the final surviv¬ 
ing photons providing the current pulse input to the shaping circuitry as shown in 
Fig. 10.21. After initializing the radionuclide and its corresponding monoenergetic 
source decomposition, the simulator transports the “ideal” EMS through the shield that 
incorporates both absorption (attenuation) and scattering (Compton) properties using 
the prescribed shield parameters. The output of this step is specified by the percentage 
of the photons escaping the shield and those captured or absorbed by the material and 
converted to thermal energy. The surviving photons escaping are then transported to 
the detector material where they undergo further absorption and scattering with the 
survivors converted to charge (electrons) provided as the input to the detector shaping 
circuitry. 

To illustrate the Bayesian approach using physics-based signal processing mod¬ 
els, we choose a single monoenergetic source sequence to represent a radionuclide 
with parameters, {e 0 , X. a , N e (o)} and generate the distortion and Gaussian smearing 
to synthesize the noisy detector output as illustrated previously in Fig. 10.21. Next 
we investigate the development of a sequential Bayesian processor for the following 
problem which can be stated formally as: 

GIVEN a set of noisy y-ray detector measurements, {z(4)} and a set of a priori param¬ 
eters {e 0 , X a , N f (o)\ or equivalently its state-space representation, £ 0 = {A 0 ,B 0 , C 0 ], 
along with a known (generated) EMS, {£ 0 (f)}, FIND best estimate of the underlying 
radionuclide EMS, {y(/>)j. 

For our problem we assume we have a “good” synthetic model of the EMS and 
we construct the ideal physics-based processor with known parameters {e a , z 0 , N f (o)} 
or equivalently known (generated by model) EMS. Note that we use the simplified 
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notation, % 0 (t) —*■ §(t; x 0 , X 0 ). Therefore, the state-space representation is 

given by 


x 0 (t) = A 0 x 0 (t) + b 0 i- 0 (t) + w To (t) [Process] 
y(t) = c' 0 x 0 (t) + v(t) [Measurement] 

z(t k ) = y(t k ) + v q (t k ) [ADC] (10.55) 

where w To ~A r (0 ,R WoWo ), v~Af(0,R vv ) and v q ~Af(0,R VqVg ). Under these linear 
assumptions with additive Gaussian noise processes, the optimal processor is the 
Kalman filter [17], 

In order to develop the particle filter for this problem we require that the transition 
and likelihood distributions; therefore, under the modeling assumptions (Gaussian 
noise, known input, parameters, etc.), we have that: 

A(x(t)\x(t - 1)) ~ Af(A 0 x(t - 1) + b 0 % 0 (t),R WtoWro (t - 1)) 

C(y(t)\x(t))~Af(c' 0 x(thR vv ) 

Therefore, the bootstrap particle filter implementation for this problem for 
i=l,... ,N P is: 

. Draw: x,(t) ~ A(x(t)\x,(t - 1)); w, ~ Pr(w,(t)); 

• Weight: Wj(t) = C(y(t)\x(t))-, 

• Normalize: W/(t); 

• Resample: x,(t) => X((t); 

• Posterior: Vr(x(t)\Y,)^ ^ Wi(t)8x(t) - Ht)’, 

• Inferences: x(t\t),XMAp(t). 

This completes the formulation and Bayesian processor realizations both for the 
Kalman and particle filter designs, next we synthesize a radiation detection problem 
and apply the processors. 

Suppose we have a nuclide represented by a single monoenergetic source of energy 
level, e a = 3.086 keV. Using the transport simulator with the following Gaussian noise 
variances: R ww = 10 -6 and R vv = 10 -2 , we generated a realization of the noisy EMS. 
Next we construct the EMS signal enhancer and the results are shown in Fig. 10.23 
where we observe the raw synthesized data illustrated along with the enhanced 
Bayesian processor estimates (both conditional mean and maximum a-posteriori). 
We see the enhanced EMS signal in (a) along with a zoomed version to observe 
the actual enhancement. Note the zero amplitude level noise has been minimized 
as part of the enhancement process. The optimal, X opt (Kalman filter), and particle 
filter inferences for both conditional mean and maximum a-posteriori are annotated 
in Fig. 10.23; however, all of the realizations overlay one another so they are hard to 
differentiate. 
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PF predicted measurement 

x r _ 




(b) 

FIGURE 10.23 Bayesian Processor for radiation detection signal enhancement, (a) Entire 
EMS enhancement with box annotating zoom area, (b) Zoomed EMS with raw and 
enhanced processor outputs. 


10.4.5 Physics-Based Bayesian Deconvolution Processor 

In this section we consider extending the BP algorithm to solve the problem of esti¬ 
mating an unknown input from data that have been “filtered.” This problem is called 
deconvolution in signal processing literature and occurs commonly in seismic and 
speech processing [63] as well as transient problems [17, 64], 

In many measurement systems it is necessary to deconvolve or estimate the input 
to an instrument given that the data are noisy. The basic deconvolution problem is 
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u(t) -• 


m) 


-- y(t) 


y(t) = H(t)*u(t) 

(a) 



H(t) 



v(t) 


y(t) = H(t)* [ u(t ) + w(0] + v(J) 

(b) 


FIGURE 10.24 Model-based deconvolution problem: (a) Deterministic problem, 
(b) Stochastic problem (Gauss-Markov formulation). 


depicted in Fig. 10.24a for deterministic inputs {u(t)} and outputs {y(f)j. The problem 
can be simply stated as follows: 

GIVEN the impulse response, H{t) of a linear system and outputs {>■(;)}, FIND the 
unknown input {u(t)} over some time interval. 

In practice this problem is complicated by the fact that the data are noisy and 
impulse response models are uncertain. Therefore, a more pragmatic view of the 
problem would account for these uncertainties. The uncertainties lead us to define the 
stochastic deconvolution problem shown in Fig. 10.24ft. This problem can be stated 
as follows: 

GIVEN a model of the linear system, H(t) and discrete noisy measurements (yff)}, 
FIND the minimum (error) variance estimate of the input sequence {u(f)} over some 
time interval. 

The solution to this problem using the Bayesian processor involves developing a 
model for the input and augmenting the state vector [17]. Suppose we utilize a discrete 
Gauss-Markov model and augment the following Gauss-Markov model of the input 
signal: 


u(t) = F(t - 1 )u(t - 1) + n(t - 1) 


(10.56) 


where n ~ Af(0, R nn (t)). The augmented Gauss-Markov model is given by 
X u :=[x' | u']' and w' u :=[w \ n\: 


X u (t) = A u {t - 1 )X u {t - 1) + w u (t - 1) 


(10.57) 
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and 

y(t) = C„Wu(f) + m (10.58) 

The matrices in the augmented model are given by 


'A(f-l) B(t-l) 
0 F{t - 1) 


'R ww (t- 1 ) R wn {t- 1 )' 

_Rnw(t- 1 ) Rnn(t- 1 )_ 


and 


C u (t) = [C(f) | 0] 

This model can be simplified by choosing F =/; that is, a is a piecewise constant. 
This model becomes valid if the system is oversampled [64]. The BP for this problem 
is the standard Kalman filter Bayesian algorithm with the augmented matrices given 
by the equations: 


State prediction: X u (t\t — 1 )=A u X u {t — 1 \t — 1) 

Innovation : e(t) = y(t)-y{t\t- 1) where y(t\t- l) = C u X u (t\t- 1) 
: jfate correction : X u (t\t) = X u (t\t - \ ) + K(t)e(t) 


with K(t), the Kalman gain calculated using the state error and innovations covariance 
matrices where X u {t\t) := E{X u (t)\Y t \, that is, the conditional mean estimate of the 
augmented state given all of the previous data up to time t. Note that this is an optimal 
estimator under Gaussian assumptions (see Candy [17] for details). 

One approach to estimating the unknown input sequence, u(t) is to use a Taylor- 
series representation [64] given by 


where a, = for w(f + AT) ~ JNa, )• 

For our problem [47, 48] we consider two cases: the composite system of the 
pre-amplifier and pulse shaping circuits; and (ii) the pre-amplifier subsystem. In case 
(i) we assume all of the required information about the EMS is available at the output 
of the pulse shaping circuitry and the quantifier (ADC) merely extracts the maximum 
amplitude of the Gaussian shaped pulse and corresponding arrival times, [{£;}, [r,]]. 
However, we also consider case (ii) where we measure the output of the pre-amplifier 
(separately). Here the energy deposited by the y-ray and subsequent charge curve 
reveals more detailed information about the y-ray physics (arrival times, multiple 
arrivals, etc.). We digitize the actual pre-amplifier output generating a time series of the 
pulse and then perform the deconvolution to extract an “enhanced” y-ray pulse through 
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the recovered (deconvolved) charging curve leading to the enhanced EMS. Once the 
EMS is recovered, the inherent photon information can be extracted (amplitudes and 
arrivals) and counted or employed as the input to a parameter estimator capable of 
providing improved energy (amplitude) estimates and corresponding arrival times 
while minimizing the noise and uncertainty. 

In this section we concentrate on the model and deconvolved EMS. We accomplish 
the deconvolution by performing a system identification [ 17] of both pre-amplifier and 
pulse shaper to obtain transfer function estimates and then incorporate these estimates 
in the deconvolution algorithm. In this manner we will eventually be able to construct 
the final Bayesian sequential processor. 

Once the deconvolved EMS is available from the processor, a Bayesian detector 
can be constructed to “decide” whether or not the threat radionuclide is present. If 
we assume the deconvolved and enhanced EMS is captured by §(f; e, r, A.). Thus, the 
binary detection problem is defined by testing the hypotheses 

H 0 : y{t) = v(t) [Noise] 

Hi : y(t) = |(t; e, r, A) + vit) [Signal + Noise] (10.60) 

with v ~ Af(0, R vv ). Thus, under the Neyman-Pearson criterion the optimal sequential 
decision function is the log-likelihood ratio given by [65] 

A 5 (f) = A$(f - 1) + lnPr(y(t)|7f i) - In Pi(y(t)\H a ) (10.61) 

here the distributions are specified by the inherent statistics associated with the EMS 
that still must be determined. Our approach will be to develop particle filters capable of 
estimating the appropriate posterior distributions. The sequential detector is therefore 
given by 


Hi 

A|(f) J r 5 (10.62) 

H 0 

Ultimately, this scheme will be implemented to perform the radionuclide contraband 
detection. 

10.4.6 Results 

In this section we discuss the results of developing the models for both pre-amplifier 
and pulse shaper and applying them to perform the deconvolution operation [47], [48]. 
We injected a set of pulses into both subsystem components individually obtaining 
the required transfer functions and then developed the physics-based deconvolution 
processor as discussed in the previous section. 

The test of the algorithm is on a simulated radionuclide ( 60 Co) EMS with 1.17 and 
1.33 MeV lines generating the random detector input sequence. Here we convolved the 
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FIGURE 10.25 Bayesian deconvolution processor design, (a) Response estimate from 
system identification of composite (preamplifier and pulse shaper) system, (b) Decon¬ 
volution processing using identified impulse response (transfer function) with synthesized 
(known) EMS (®Co: 1.17 and 1,33MeV lines). 


simulator output (deposited energy) with the identified composite transfer function. 
The results are shown in Fig. 10.25, where we see the transfer function validation run 
in (a) and the actual deconvolution processor output in ( b ). The processor is capable 
of extracting the EMS successfully and improving the overall y -ray spectrum signifi¬ 
cantly as shown in Fig. 10.26. In (a) we see the “true” spectrum indicating two sharp 
energy lines at the correct energies (1.17 and 1.33 MeV), ( b) the estimated (decon¬ 
volved) spectrum has captured the lines with some uncertainty (spreading shown), but 
its performance is quite reasonable and demonstrates the enhancement as observed 
from the measured spectrum of (c). Thus the Bayesian deconvolver works quite well 
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FIGURE 10.26 Deconvolution processor performance/enhancement ( 60 Co: 1.17 and 
1.33 MeV lines), (a) Histogram of True EMS (synthesized), (b) Processed EMS histogram, 
(c) Raw (synthesized) measured detector output histogram. 


on the synthesized data set. Thus, it appears that the processor can reliably extract the 
input excitation using this physics-based approach. 
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Appendix A 


PROBABILITY AND 
STATISTICS OVERVIEW 


A. 1 PROBABILITY THEORY 

Defining a sample space (outcomes), Q, a field (events), B, and a probability function 
(on a class of events), Pr, we can construct an experiment as the triple, {£2, B, Pr}. 


Example A.l 

Consider the experiment, {£2, B, Pr} of tossing a fair coin, then we see that 
Sample space: £2 = {H, T) 

Events: B = {0, {H}, {7 1 }} 

Probability: Pr (H) = p 

Pr(D = 1 -p AAA 

With the idea of a sample space, probability function, and experiment in mind, 
we can now start to define the concept of a discrete random signal more precisely. 
We define a discrete random variable as a real function whose value is determined 
by the outcome of an experiment. It assigns a real number to each point of a sample 
space £2, which consists of all the possible outcomes of the experiment. A random 
variable X and its realization x are written as 


X(a >) = x for coe£2 


(A.l) 


Consider the following example of a simple experiment. 
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Example A.2 

We are asked to analyze the experiment of flipping a fair coin, then the sample space 
consists of a head or tail as possible outcomes, that is, 

n = {0 H T} =► X{co) = x 
co={H,T} 

If we assign a 1 for a head and 0 for a tail, then the random variable X performs the 
mapping of 

X(w = H) = x(H) = 1 
X(co = T) = x(T) = 0 


where x(.) is called the sample value or realization of the random variable X. AAA 
A probability mass function defined in terms of the random variable, that is, 

P x(xi) = Pr(X(a;«) = Xi ) 
and the probability distribution function is defined by 
F xtxd = Pr (X(a>i) < Xj) 

These are related by 

P x(xi) =Z l Fx(xi)S(x-x i ) 

F x(Xi) = E, P x(xi)fi(x - Xi) 

where 8, and p, are the unit impulse and step functions, respectively. 

It is easy to show that the distribution function is a monotonically increasing 
function (see Papoulis [1] for details) satisfying the following properties, 

lim Fx(x,-) = 0 


(A.2) 

(A.3) 

(A.4) 

(A.5) 


and 


lim Fx(xi) = 1 

These properties can be used to show that the mass function satisfies 

I>x(*;)= i 


Either the distribution or probability mass function completely describe the 
properties of a random variable. Given either of these functions, we can calculate prob¬ 
abilities that the random variable takes on values in any set of events on the real line. 
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To complete our coin tossing example, if we define the probability of a head occurring 
as p, then we can calculate the distribution and mass functions as shown in the 
following example. 


Example A.3 

Consider the coin tossing experiment and calculate the corresponding mass and 
distribution functions. From the previous example, we have 


Sample space: 

Events: 

Probability: 

Random variable: 


Distribution: 


S2 m {H, T] 

B = {0,{H},{T}} 

Px(xi =H) = p 
Px(xo = T) = 1 -p 
X(wi =H) = x i = l 
X(co 2 = T) = x 2 = 0 

I I Xi > 1 

1 -p o < X; < 1 

0 Xi< 0 


the mass and distribution functions for this example are shown in Fig. A.l. Note that 
the sum of the mass function value must be 1 and that the maximum value of the 
distribution function is 1 satisfying the properties mentioned previously. AAA 

If we extend the idea that a random variable is now a function of time as well, 
then we can define a stochastic process as discussed in Chapter 2. More formerly, a 
random or stochastic process is a two dimensional function of t and or. 


X(t, co) coe£2, teT 


(A.6) 


where T is a set of index parameters (continuous or discrete) and £2 is the sample 
space. 


F xM 
(1-P)' 


FIGURE A.l Probability mass and distribution functions for coin-tossing experiment. 
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We list some of the major theorems in probability theory and refer the reader to 
more detailed texts [1,2], 


Univariate: 

Bivariate: 

Marginal: 

Independent: 

Conditional: 

Chain Rule: 


Pr(X) 

Pr(X, Y) 
Pr(X) 

Pr(X, Y) 
Pr(X|y) 
Pr(X, Y,Z) 


Px(x) 

P xv(x,y) 

Ey p xv(x,y) 
p x (x) X Py(y) 
Pxr(x,y)/Py(y) 

Pr(X|T,Z) x Pr(T|Z) x Pr(Z) 


For a random variable, we can define basic statistics in terms of the probability 
mass function. The expected value or mean of a random variable X, is given by 


and is considered the typical or representative value of a given set of data. For this 
reason, the mean is called a measure of central tendency. The degree to which numer¬ 
ical data tend to spread about the expected value is usually measured by the variance 
or equivalently, auto-covariance given by 

R„ = E{(X - m x ) 2 } 

The basic statistical measures are called ensemble statistics because they are measured 
across the ensemble (i = 1,2,...) of data values, that is, the expectation is always 
assumed over an ensemble of realizations. We summarize these statistics in terms of 
their mass function as: 


Expected Value: m x — 

E{X} 

= EiXiPxixi) 

N th - moment: 

E{X n } 

= T.iX’iPx(xi) 

N th -moment about mean: 

E{{X - m x ) n } 

= Ei (Xi ~ m x ) n P x ( Xi ) 

Mean Squared ( N = 2): 

E{X 2 } 

= EiXfPxixi) 

Variance: R xx = 

E{{Xi - m x ) 2 } 

= E, (X t ~ m x ) 2 Px( Xi ) 

Covariance: 

Rxy 

= E{(Xi — m x )(Yj — m y )} 

Standard Deviation: 

&XX 

= VR^c 

Conditional Mean: 

E{X\Y} 

= E/*;P(W 

Joint Conditional Mean: 

E{X\Y,Z) 

= Y,iXiP{Xi\Y,Z) 

Conditional Variance: 

R x\y 

= E{(X - E{X\Y}) 2 \Y] 


These basic statistics possess various properties that enable them to be useful for 
analyzing operations on random variables, some of the more important 1 are: 

Linearity: E{ax + b] = aE{x] + b = am x + b 

Independence: E{xy} = E{x}E{y} 

Variance: R xx {ax + b) = a 2 R xx 


1 Recall that independence states that the joint mass function can be factored, Pr(x, y) = Pr(x) x Pr(y), 
which leads to these properties. 
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Covariance: 

Uncorrelated: E{xy} = E\x\E\y\ {R xy = 0} 

Orthogonal: E{xy] = 0 

Note that the expected value operation implies that for stochastic processes these basic 
statistics are calculated across the ensemble. For example, if we want to calculate the 
mean of a process, that is. 


m x (t) = E{X(t, (od = x,(t)} 

we simply take the values of t = 0,1,... and calculate the mean for each value of 
time across (i = 1,2,...) the ensemble. Dealing with stochastic processes is similar 
to dealing with random variables except that we must account for the time indices 
(see Chapter 2 for more details). 

Next let us define some concepts about the probabilistic information contained in 
a random variable. We define the (self) information contained in the occurrence of 
the random variable X(a>i) = Xj, as 

I(xi) = log/, Px(xf) (A.7) 

where b is the base of the logarithm which results in different units for information 
measures (base = 2 —> bits) and the entropy or average information of X(a>i) as 


H( Xi ) = -E[I( Xi )} = J2 p *fe) l°g b p x(xi) (A. 8) 


Consider the case where there is more than one random variable. Then we define the 
joint mass and distribution functions of an A-dimensional random variable as 

Px(xi,..., x N ), Ex(x\ ,..., x N ) 

All of the basic statistical definitions remain as before, except that we replace the scalar 
with the joint functions. Clearly, if we think of a stochastic process as a sequence 
of ordered random variables, then we are dealing with joint probability functions, 
that is, a collection of time-indexed random variables. Suppose we have two random 
variables, x \, and X2 and we know that the latter has already assumed a particular 
value, the we can define the conditional probability mass function of x\ given that 
X(w 2 ) = X2 has occurred by: 

Pr(.n | x 2 ) := Px(*(«i) | X{(02) = x 2 ) (A.9) 


and it can be shown from basic probabilistic axioms (see Papoulis [1]) that 


Pr(xi | x 2 ) = 


PrCt'i ,x 2 ) 
Pr(x 2 ) 


(A. 10) 
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Note also that this expression can also be written as 

Pr(x,,* 2 ) = Pr(x 2 | xi)Pr(xi) (A. 11) 

Substituting this equation into Eq. A. 10 gives Bayes’rule, that is, 

Pr(xi | x 2 ) = Pr(x 2 | xi) ^ 1 ? (A. 12) 

Pr(x 2 ) 

If we use the definition of joint mass function and substitute into the previous 
definitions, then we can obtain the probabilistic chain rule [1-3], 

Pr(xi,... ,x N ) = Pr(xi | x 2 ,... ,x w )Pr(x 2 \ x 3 ,.. .x N )... Pr(x, v _i | x w )Pr(x A ,) 

(A. 13) 

Along with these definitions follows the idea of conditional expectation, that is, 

E{xt | Xj } = J2 X,Pr(x, | xj) (A. 14) 


With the conditional expectation defined, we list some of their basic properties: 


1. E x {X\Y}=E{X), if X and Y are independent 

2. E{X} = E y {E{X\Y}} 

3. E x {g(y)X\Y} = g(y)E{X\Y} 

4. E x , y {g(Y)X}=E y {g(Y)E{X\Y}} 

5. E x {c\Y} = c 

6. E x {g{Y)\Y}=g{Y) 

7. E x ^ y {cX + dY\Z} = cE{X\Z} + dE{Y\Z} 


The concepts of information and entropy can also be extended to the case of more 
than one random variable. We define the mutual information between two random 
variables, x,- and x/ as 


Hxf,Xj) = log, 


Pyfc 1 xj) 
P x(xt) 


(A. 15) 


and the average mutual information between X(pp) and X(coj) 


I(Xi-Xj ) = E XiXj {I{x h Xj)} = ? x(xi,Xj)I(xi,Xj) (A. 16) 


which leads to the definition of joint entropy as 

H{Xi-Xj) = -J2 J2 p x(xi,Xj)log b P xixoxj) (A. 17) 


This completes the section on probability theory, next let us consider an important 
multivariable distribution and its properties. 
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A.2 GAUSSIAN RANDOM VECTORS 

In this section we consider the multivariable Gaussian distribution used heavily in this 
text to characterize Gaussian random vectors, that is, z ~ Af(m z , R,,) where z e 1Z NzX 1 
and defined by 


Pr(z) = {2n)~ N d 2 1 R,. | _ 1 /2 e xp (-'-(z - mjRfUz - m,)j (A.18) 

where the vector mean and covariance are defined by 

m, := E{ z} and R = Cov(z) := E[(z — m z )(z — m-)'} 

Certain properties of the Gaussian vectors are useful such as: 

• Linear transformation. Linear transformations of gaussian variables are 
gaussian; that is, if z ~ N(m z , R--) and y = Az + b, then 

Y ~ Af( Am, + b, AR--A') (A. 19) 

• Uncorrelated Gaussian vectors. Uncorrelated gaussian vectors are independent. 

• Sums of Gaussian variables. Sums of independent gaussian vectors yield gaus¬ 
sian distributed vectors with mean and variance equal to the sums of the 
respective means and variances. 

• Conditional Gaussian vectors. Conditional Gaussian vectors are gaussian 
distributed; that is, if x and y are jointly Gaussian, with 

m = = E {y) = [m’] ; “ d r “ = Cov<z)= [r” y 

then the conditional distribution for x and y is also Gaussian with conditional 
mean and covariance given by 

m X |y = m r + R.vvR- 1 (y - m y ) 

Rx|y = R.fA- — Rry^v)' 1 

and the vectors x — .E{x|y} and y are independent. 

• Gaussian conditional means. Let x, y and z be jointly distributed Gaussian 
random vectors and let y and z be independent, then 

£{x|y, z} = £{x|y} + £{y|z} - m x 
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A.3 UNCORRELATED TRANSFORMATION: GAUSSIAN RANDOM 
VECTORS 

Suppose we have a Gaussian random vector, x ~ Af(m x , R XY ) and we would like to 
transform it to a normalized Gaussian random vector with the mean, m Y , removed so 
that, z ~ W"(0,1). Assume that the mean has been removed (z —> z — m Y ), then there 
exists a nonsingular transformation, T, such that z = Tx and therefore 

R = Cov(z) = Cov((Tx)(Tx)') = TR VV T = I (A.20) 

Thus we must find a transformation that satisfies the relation 

R zz = I = TR vt T (A.21) 

Since R YY is a positive semi-definite, symmetric matrix it can always be factored 
into matrix square roots (R = UU' = R YY 2 R Y / 2 ) using a Cholesky decomposition [4]; 
therefore, Eq. A.21 implies 

R zz = I = (TU)U'T' (A.22) 

or simply that 

T = U -1 = R” 1 / 2 (inverse matrix square root) (A.23) 

and therefore 

R,, = (R- 1 /2 U ) U 'R-T’/2 = R-i/2 RxcR -r/2 = J (A. 24) 


the desired result. 

This discussion completes the introductory concepts of probability and ran¬ 
dom variables which is extended to include stochastic processes in Chapter 2 and 
throughout the text. 
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Cantilever array, 398 
Cantilever physics model, 400 
Cartesian coordinates, 171, 224 
Cartesian tracking model, 195, 234 
CCD camera, 372 
Central difference, 218, 331 
Central limit theorem, 65, 66, 83 
Central moments, 284 

Ch a in rule, 21, 33, 39, 43, 84, 153, 164, 169, 
178, 240 

Chain rule decomposition, 84 
Chain rule of probability, 38 
Channel impulse response, 389 
Chapman-Kolmogorov, 39, 41 
Chapman-Kolmogorov equation, 150, 244 
Characteristic polynomial, 109 
Chemical deflection, 400 
Chi-square (C-Sq) tests, 280, 281 
Chi-squared distributed, 185 
Cholesky decomposition, 265 
Cholesky factor, 219 
Classical (EKF), 327 
Classical approach, 4, 139, 270, 317 
Classical nonlinear processors, 220 
Classification, 352 
Classification theory, 8 
Coded signal, 358 
Coefficient of variation, 247 
Coin tossing, 425 
Coloring filter, 146 
Communications satellite, 17 
Complete data, 26,27 
Complete likelihood, 32, 367 
Complete log-likelihood, 27 
Completely controllable, 109, 110 
Completely observable, 107,110 
Complex sinusoids, 47 
Condensation, 245 
Conditional 
density, 24, 77 
distributions, 75, 219, 429 
expectations, 27, 30,45, 151,163, 169, 428 
Gaussian distributions, 163, 169 
independence, 85, 240, 244, 339, 340, 342, 
345, 349 
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likelihood distribution, 148 
mean, 3, 32, 34,37, 52,167,169,219, 
222, 239,271,411 
mean estimate, 271 
probability, 2, 32, 148 
transitions, 75 

Conditionally independent, 39, 148 
Conditionally unbiased, 34 
Confidence interval, 184 
Confidence limits, 166 

Constant velocity model, 225 

analog domain, 100 
approximation, 265 
distribution, 238, 264 
dynamics, 102 
observation, 337 
probability distribution, 264 
random variable, 92 
state transition matrix, 113 
discrete, 146 

time, 96,100,102, 104,112,139 
Gauss-Markov model, 112 
Gaussian stochastic processes, 112 
process, 100 
stochastic process, 114 
systems, 95, 100, 104 
Controllability, 108 
Control loop, 373 
Convergence, 54, 55, 249 
Converges in-distribution, 65 
Converges uniformly, 220, 222 
Convolution, 125 
Coordinate systems, 230 
Corrected covariance, 307 
Corrected state/parameter estimates, 306 
Corrected state equations, 306 
Corrected state estimate, 180 
Correction equation, 42 
Correlated Gauss-Markov, 121 
Counting functions, 354, 355 
Counts, 28 
Covariance, 125, 208 
estimates, 186 
function, 143 
matrix, 131 
Coverage, 270 
Covering, 62 

Cramer-Rao lower bound, 21, 22, 45, 49, 
395 

Cross-covariance, 185 
Cross error covariance, 215 


Cumulative, 53 

Cumulative distribution, 92, 278, 282 
function, 13, 57, 248 

Decoding, 347, 350, 351 
Decomposition, 77, 111, 307 
Deconvolution, 412, 415, 416 
Deconvolution problem, 227 
Degeneracy, 246, 247, 248, 264 
Delta family, 220 
Delta function, 66, 221, 223 
Demodulation, 333 
Density, 13, 53, 203 
Depletion, 246 
Descent algorithm, 310 
Detailed balance, 71,78 
Detection problem, 332 
Determinant, 99 
Deterministic, 107 
Deviation data, 378 
Diagnostic testing, 280, 289 
Difference equation, 122, 129, 143 
Differential equations, 96, 100, 102 
Dirac delta function, 83 
Directed graph, 16, 337, 343 
Direct method, 56 

Discrete cumulative distribution functions, 58 
Discrete 
domain, 100 

nonlinear process, 138, 146, 165, 170 

observation, 353 

posterior distribution, 248 

power spectrum, 118 

probability mass function, 265 

probability matrix, 336 

random variable, 90 

state transition matrix, 106 

sums, 66 

system, 104, 105, 107 
systems theory, 107, 140 
transfer function, 109 
time, 104, 112, 139 

hidden Markov model, 335 
Markov chain, 336 
representation, 95 
systems, 104 
variable, 12 
variate, 13 

Wiener-Hopf equation, 35 
Dispersion index distribution, 284 
Dispersion relation, 386 
Distance measure, 275 
Distance metric, 293 
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Distribution, 2, 57, 65, 71 
estimation, 8, 53 
function, 424 
validation, 273 
Diverge, 183 
Divergence, 289 
Divergence measure, 273 
Diversification, 264 
Diversity, 249, 264, 266, 318 
DNA strings, 351 
Draw samples, 82 
Dynamic, 2, 8 
physical systems, 339, 351 
random variables, 37 
state, 299 

state variables, 148 
systems, 96, 97, 104,107, 110 
variables, 39, 41, 43, 82, 84 
wavenumber, 322 

Effective number of particles, 247 
Efficient, 21 

Eigen-decomposition, 102 
Eigenvalues, 111 
Eigenvector matrix, 111 
Eigenvectors, 102 
Elliptical orbit, 49 
EM algorithm, 362, 367 
EM/Baum-Welch, 355, 356 
EM principle, 27 
Embedded dynamic model, 280 
Emission densities, 31 
Empirical approximation, 314 
Empirical distribution, 7, 60, 64, 66, 246, 261, 
263,265,280,315,317 
Empirical posterior distribution, 251, 264 
Empirical prediction cumulative distribution, 278 
Enhanced signal, 9 
Ensemble, 167, 169, 227,426 
Ensemble estimates, 181, 286 
Entire hidden state sequence, 347 
Entire state sequence, 347 
Entire state sequence (path) estimation, 350 
Entropy, 273, 427, 428 
Epanechnikov, 53, 265 
Ergodic, 71, 183, 184 
Ergodic process, 142 
Error covariance, 162, 223 
Error covariance matrix, 307 
Error covariances, 36 
Error variance, 24, 25, 33, 251 
Estimates, 2, 3 
Estimated distributions, 275 


Estimated instantaneous posterior distribution, 286 
Estimated posterior distribution, 273 
Estimated residual, 280 
Estimated state, 309, 312 
Estimation, 2, 8 

Estimation error, 20, 34, 162, 223 
Estimation problem, 22 
Estimation scheme, 270 
Estimators, 13, 36 
Estimator quality, 21 
Evaluation, 357 

Evaluation problem, 342, 343, 345 
Event mode sequence, 406 
Evidence, 2, 19, 37, 52, 64, 82, 85 
Expectation, 6, 64 
Expectation maximization, 25, 299 
Expectation maximization ( EM) algorithm, 352 
Expectation-step, 26, 28, 30, 367 
Expected value, 427 
Exponential 
class, 30 

distribution, 58, 59, 61, 93 
family, 30, 32 

Extended Bayesian processor, 147, 167, 303 
Extended Kalman filter, 147, 167, 191, 303, 398 
External, 109 

Factored power spectrum, 143 
Factorization techniques, 169 
Feature, 241 
Field, 423 

Filtered conditional, 150 

Filtered measurement, 168, 181, 310, 313 

Filtering, 242 

Filtering distribution, 36, 42, 150, 315 
Filtering posterior, 221, 223, 239, 261, 315 
Filtering posterior probability, 366 
Financial systems, 294 
Finite impulse response, 123 
First differences, 193, 321, 331 
First difference approximation, 104 
First order 
Markov, 38, 340 
Markov process, 40 
Markov property, 342 
Forensic analysis, 351 
Forward algorithm, 350 
Forward and backward, 355, 356 
Forward-backward 
algorithm, 347 
approach, 347 
recursion, 347 
Forward operator, 343, 353 
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Forward recursion algorithm, 343, 345 
Fourth-order moments, 207 
Frequency modulation, 333 
Frequency ratio, 54 
Fundamental theorem of calculus, 58 
Fusion experiments, 369 

Gain, 154,155,162, 168,176,180,182, 306 
Gain matrix, 182 

Gamma ray spectrometry, 405, 408 
Gauss-Hermite (G-H) grid-based integration, 280 
Gauss-Hermite numerical integration, 218 
Gauss-Hermite rule, 218 
Gauss-Markov, 51, 389 
Gauss-Markov equations, 115 
Gauss-Markov model, 112,115,117,120,121, 
143,146, 151,152,157,163,169,173, 

174,193, 196, 225, 226,296, 330-332, 

374, 394,413 

Gauss-Markov perturbation model, 137, 160 
Gauss-Markov ep e e n 95, 114, 139, 145, 
333,374 

Gauss-Markov state-space model, 117 
Gauss-Markov wavefront curvature model, 296 
Gauss-Newton, 310 

Gaussian, 2,7,11,30, 53, 66, 68,73,121,142, 
150,239, 265, 276 
approximation, 233 
based technique, 277 
density, 220 
distributed, 13 

distributions, 51, 74, 79, 115, 280 
importance distribution, 270 
importance proposal, 270 
kernel estimator, 56 
mixtures, 73, 220-223, 276 
mixture approach, 280 
mixture distribution, 221, 276 
mixture framework, 220 
noise, 93,95,213 
posterior, 51 
prior, 25, 222, 270, 271 
processes, 115, 283 
proposal, 270 
random variables, 24, 141 
random vectors, 429 
sequences, 273 

sums, 220, 223, 230, 235, 280 
vectors, 429 
window, 56 
window estimator, 80 

Gibbs sampler, 51,71,75, 77,79, 87,92,266 
Golden Rule of Sampling, 62 
Goodness of fit, 277, 281, 282 


Gradient, 179 
operation, 169 
operator, 153, 164, 179 
vector, 12,20, 178,198 
Gray box, 122 
Green’s function, 357, 358 
Green’s function approach, 384 
Grid-based, 4, 218,230 

Hankel function, 385, 387 
Hankelmatrix, 109, 111 
Helmholtz equation, 386, 394 
Hessian, 178,179 
Heuristic, 182 
Hidden dynamic, 148 
Markov chain, 337, 341, 362 
Markov models, 335, 337, 338, 340, 345, 362 
Markov processors, 335 
states, 238, 347, 366 
state estimation, 345 
state estimation problem, 346 
variables, 26,27, 30, 340,345 
Higher order moments, 208 
High probability, 246 
High probability regions, 239, 264, 266 
High process noise, 264 
Histogram, 8, 52-54, 56, 75 
HMM parameter estimation, 350, 355 
Homogeneous, 70 
Homogeneous-in-time, 336, 340 
Homogeneous chain, 336 
Hypothesis test, 185, 280, 282 

Implicit marginalization, 270 
Importance distribution, 81, 82, 84, 240-242, 
244,245,261,316,317 
Importance estimator, 83 
Importance sampler, 80 
Importance sampling, 51, 64, 81, 82, 237, 239, 
247, 266, 316 
algorithm, 246 
approach, 87 
distribution, 81 
estimator, 82 

Importance weights, 240, 244-246, 249, 253, 285 

Impulse function, 7, 13 

Impulse response, 358 

Impulse response matrices, 97, 109 

Impulse response model, 413 

Impulse sampler, 13 

Incomplete data, 28, 33 

Incomplete measurements, 31 

Independent-identically distributed, 52 

Independent samples, 52, 64 
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Indicator function, 93 
Individual state estimate, 347 
Inferences, 52 

Infinite impulse response, 123 
Infinite power series, 109 
Information, 427 
Information matrix, 21 
Information theoretic approaches, 273, 289 
Innovations, 11, 152, 161, 162, 166, 176, 182, 
181, 185,227,273,277,403 
covariance, 154, 155,184, 305, 306, 331 
model, 120, 121, 145, 330 
representation, 131, 331 
sequence, 155,168,174,182-184, 309, 312, 
324,330, 374, 395 
vector, 121,170,183 
Input-output structure, 122 
Input excitation, 100 
Input transmission, 101 
Input transmission matrices, 101, 113 
Instantaneous approximation, 239 
Instantaneous posterior distribution, 322, 324 
Instantaneous posterior filtering distribution, 285 
Intensity parameter, 28 
Internal probabilistic representation, 351 
Internal structural model, 351 
Internal structure, 350-352 
Internal variables, 97, 107 
Invariance, 71 
Invariant, 23 

Invariant distribution, 4, 70-73, 75, 78, 266 
Inverse Laplace transform, 97 
Inverse transform approach, 92 
Inverse transformation Theorem, 59 
Inverse CDF method, 60, 62 
Inversion problem, 56, 58, 62 
Invertible transformation, 57 
Irregularly spaced, 103 

Iterated-extended Bayesian processor, 147, 176, 
191 

Iterated Kalman filter, 270 
Iterative, 64 
approach, 352, 355 
methods, 87 
sampling techniques, 80 
simulations, 79 
technique, 30 


Jacobian, 57, 161,167,176,197, 208, 215, 304, 
309,330 

Jacobian matrices, 137, 160, 161, 168, 169 
Jacobian process matrix, 305 
Jittering, 264, 265 


bivariate Gaussian, 78 
distribution, 38, 318, 339, 348 
dynamic distribution, 2 
entropy, 428 

estimation, 227, 302, 314 
event, 353 
index, 284 

mass function, 426, 428 
PF, 334 

posterior, 314, 315 

posterior distribution, 37, 41, 75, 78, 299, 327 
posterior estimation problem, 39 
probability density, 22 
random particles, 268 
SPBP, 334 
Joint state/parameter 
estimation, 299, 300, 301, 318 
posterior, 300 
processing, 302, 313, 327 
Jointly distributed, 46 
Jointly estimate, 333 
Joseph form, 193 

Kalman filter, 11, 139, 162, 183, 218, 230, 239, 
270, 293, 295, 339, 374, 411, 414 
Kalman filtering theory, 37 
Kalman gain matrix, 121 
Kalman techniques, 280 
Kernel, 53,264, 265 
density, 56, 75, 264, 317 
estimation, 53, 56, 265 
method, 317 
smoothing, 8, 53 
technique, 317 

Kirchoff current equations, 156 
Kolmogorov-Smimov, 280, 282 
Kullback-Leibler, 273, 293 
Kullback-Leibler information, 274, 276 
Kullback divergence, 275, 277 
Kurtosis, 205 

Lack of diversity, 289 
Langmuir kinetics, 399 
Laplace transforms, 97, 98, 122 
Law of Large Numbers, 4, 64, 67, 70 
Least squares, 19,43 
Least-squares estimate, 18 
Least-squares estimation, 35 
Level of significance, 185, 282 
Likelihood, 2,19, 39,52, 93,151, 245, 253, 254, 
322, 340, 341, 356, 366,404 
cumulative distribution, 273 
distributions, 3, 37, 148, 149, 261, 411 
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estimation techniques, 352 
function, 22, 23, 197 
method, 22 
probability, 148, 241 
ratio, 332, 333 
Linear algebra, 98 

Linear Bayesian processors, 155, 273, 339 
Linear discrete Gauss-Markov model, 150 
Linear dynamic systems, 367 
Linear Gaussian, 36 
Linear time-invariant , 97 
Linearization, 51, 135, 147, 160, 270 
approaches, 139,270 
based particle filter, 271 
error, 199, 200 
methods, 270, 271 
process, 167 

techniques, 95, 138,139, 161 
Linearize, 213 
Linearized, 194, 270, 271 
algorithm, 289 
Gauss-Markov models, 95 
measurement perturbation, 137 
model, 214 

particle filter, 271, 272 
process model, 137 
state-space model, 160 
Linear Kalman filter, 150 
Linear regression, 199 
Linear systems theory, 335 
Linear time-invariant system, 108 
Linear time-varying, 96, 112 
Linear time-varying state-space, 105 
Linear transformation, 115 
Local iteration, 176, 191 
Local linearization, 317 
Local linearization technique, 268 
Local linearized particle filters, 245 
Local maximum, 355 
Location, 66 

Log-likelihood, 23, 26-28, 30,275, 355 
equation, 24 
function, 355 
ratio, 415 

Logarithmic transformation, 355 

Logarithmic a posteriori probability, 163, 169 

Long-tailed distribution, 270 

Low dispersion index, 284 

Low probability, 246 

LTI model, 97, 99 

Lyapunov equation, 113, 116 

Manhattan Project, 4 
MAP estimate, 20, 23, 153, 391 


MAP state estimation, 347 

Marginal distributions, 39, 79, 84, 93 

Marginalization, 64, 346 

Marginalizing, 270, 342 

Marginal posterior distributions, 7, 26, 44 

Marked Poisson process, 31 

Markov, 39, 42, 115 

Markov chain, 4, 51, 64, 70-72, 75, 78, 318, 
336, 337, 340,366 
dynamics, 70 
methods, 64 
model, 365 
Monte Carlo, 1, 4, 70 
simulation, 71 
theory, 87, 339 
transition 
kernel, 266 
probability, 72 
assumptions, 240 
independence, 346 
model, 255 
property, 336 
representations, 2 
state-space model, 321 
state vector, 148 
structure, 237 
Markov parameters, 109 
Markov property, 84, 345 
Markov sequence, 109 
Markov switching model, 366 
Mass function, 13, 53, 238 
Matched filter, 357, 358, 375 
Matrix 

decomposition methods, 102 
difference equation, 106 
differential equation, 99 
exponential, 98, 101, 102 
inversion lemma, 153, 170, 179 
square roots, 120, 208, 219, 265 
Maximum, 240 
Maximal path, 348 

Maximization step, 26, 27, 29, 30, 32, 367 
Maximum-likelihood estimate, 185 
Maximum a posteriori , 19, 20, 36, 43, 47, 
411 

Maximum a-posteriori estimate, 25, 48, 347 
Maximum deviation, 282 
Maximum likelihood, 19, 23, 36, 47 
estimates, 22-25, 31,43,48, 354, 356 
parameter estimation, 25, 26, 30 
MC 

methods, 4, 7 
model diagnostics, 277 
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MC ( Continued) 

sampling techniques, 276 
simulation, 238 

MCMC iterative processor, 266 
MCMC sampling, 266 
MCMC-step, 266, 268, 318 
MCMC technique, 266 
Mean, 208, 426 

Mean propagation recursion, 124 
Mean-squared error criterion, 35 
Measurement 
covariance, 117 
distribution, 284 
instrument, 17 
Jacobian, 176, 271,305 
likelihood, 245, 252 
linearization, 162 
mean vector, 114, 115 
models, 16, 270, 295 
noise, 152 

nonlinearities, 176, 180, 181 
perturbation, 137 
power spectrum, 117 
prediction, 213 
system, 48, 107 
system models, 8 
variance, 114, 116 
Measure of degeneracy, 247 
Median, 286 

Method of composition, 52 
Metropolis, 74, 266 
Metropolis algorithm, 71, 92 
Metropolis-Hastings, 71, 74, 75, 266 
approach, 51 

sampler, 71, 77, 87, 92, 93 
technique, 268 
Metropolis technique, 64, 79 
Microcantilever sensor, 9, 397 
Microelectromechanical sensor, 397 
Minimal realizations, 109 
Minimum data length ( MDL ) description, 275 
Minimum error variance, 158, 182, 183 
estimator, 339 

Minimum mean-squared error, 19, 33, 240 
Minimum variance, 19, 33, 35, 289, 317 
approach, 253, 254 
design, 155, 182 
estimation, 18, 395 
estimator, 33-35,43,46 
importance distribution, 242, 271 
optimal, 261 

importance function, 245, 270 
importance proposal, 270, 289 


proposal distribution, 243 
weights, 244 
Mismatch, 395 
Missing/hidden vectors, 26 
Missing data, 26, 27, 30, 32 
Mixing coefficients, 73, 220, 222, 276 
Mixture, 220, 277 
Modal functions, 389, 395 
Model based 
approach, 1, 3, 8 
likelihood, 150 
processing,9 
processor, 8, 17, 121, 277 
signal processing, 1, 7, 8, 11, 95 
solutions, 149 
Model 

mismatches, 186 
parameters, 354, 355 
uncertainties, 8 
validation, 289 
Modem technique, 317 
Moments, 13, 65, 203, 205, 276 
Monoenergetic decomposition, 406 
Monte Carlo, 1,4, 51, 52, 65,169 
approach, 4, 64, 68, 299 
error, 242 

estimates, 7, 66, 238 
integration, 7 
methods, 4, 272 
simulation techniques, 52 
techniques, 52, 64, 65, 70 
Most probable paths, 355 
Move, 266, 334 
Move step, 266,268 
Moving average, 123, 131, 132 
Multichannel, 95, 117 
Bayesian solution, 384 
data, 401 
processor, 397 
Multimodal, 17, 53, 318 
distribution, 237 

Multinomial distribution, 250, 280 
resampling, 250 
sampling method, 252 
Multipath, 194 
Multivariable 
representation, 97 
structures, 121 
transfer function, 98 
Multivariate, 208 

Gaussian distributions, 11, 151, 152, 210, 230, 
293 

Mutual information, 274, 428 
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National Ignition Facility, 370 
Navigation, 224 
Nearest neighbor, 55 
Nelder-Meade polytope, 401 
Neural networks, 195, 234, 352 
Newton-Rhapson, 178, 191 
Neyman-Pearson criterion, 415 
Non-Gaussian distribution, 9, 52, 220 
Nonlinear, 95 
Bayesian processors, 197 
Bayesian signal processing, 191 
cost function, 177 

discrete-time state-space representation, 105 

dynamic model, 305 

dynamics, 270, 302 

dynamic systems, 96, 277 

estimation, 7, 139, 220 

filtering, 167, 181, 209 

measurement, 163 

measurement model, 137 

measurement system, 147 

models, 209, 215,304 

non-Gaussian model, 289 

non-Gaussian signal processing, 52 

parameter estimator, 303 

problems, 238 

process, 209, 210 

processing, 60 

processors, 217, 223, 230, 234, 397 
re-entry problem, 300 
sensor model, 174, 297 
signal processing, 36, 87 
state estimation, 191, 198 
state-space representation, 285, 302, 334 
stochastic vector difference equations, 135, 
160 

systems, 101, 135,139,146,160,194, 233, 
307,350 

trajectory estimation, 217, 327 
transformations, 198, 199, 201, 205, 210 
vector functions, 136, 160 
Nonparametric methods, 8 
Nonphysical systems, 351 
Nonrandom constant, 24 
Nonstationary, 52, 95, 117 
Normal form, 134 
Normality diagnostics, 283 
Normal mode theory, 384 
Normality tesung, 280, 284 
Normalization, 52, 64 
condition, 207 
constant, 62 
constraint, 201 


covariance, 184 
Gaussian random vector, 430 
innovations variance, 185 
weights, 246, 249 
Normalizing 

constant, 64, 70, 71, 82, 83 
distribution, 85 

Normal state-space form, 135 
Nuclear physics, 64 
Null hypothesis, 183 
Numerical 
implementation, 307 
integration, 3-5, 7, 37, 52, 65, 101, 103 
quadrature, 4 
Numerically stable, 307 
Nyquist sampling theorem, 100 

Observability matrix, 108 
Observable, 107 
Observation 

probability, 337, 338, 339,341-343, 352 
probability matrices, 347 
process, 339 
sequence, 339 

Observer canonical form, 129, 130 

acoustic, 324 
environment, 382 
On-line, 302, 352 

One-step prediction distribution, 223 
Optimal, 36,183 
bandwidth, 56 
Bayesian algorithms, 36 
Bayesian estimate, 238 
Bayesian processor, 152 
matched filter, 358 
minimum variance solution, 51 
path, 350 

processor, 150, 295 
Optimality, 273 
Optimality tests, 324 
Optimization, 4, 5, 8, 64, 177, 299 
Ordered moments, 206 
Ordered uniform variates, 251 
Ordinary differential equation, 8, 101, 102 
Orthogonal, 34, 142 
Orthogonality condition, 34, 35 
Orthogonal set, 102 
Outlier performance, 261 

Pade’ approximation, 101 

Parameter estimates, 2, 20, 26, 309, 312, 318, 356 
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Parameter estimation, 33, 275, 299, 300, 
314,331, 350, 352, 356 
problem, 351, 352, 362, 367 
techniques, 351 

estimators, 302, 310 
posterior distribution, 315 
space, 73 
variables, 299 
vector, 318 

Parametrically adaptive, 300, 302, 310, 
332 

Bayesian signal processor, 302 
model-based processor, 307 
Parametric 
models, 351 
posterior, 300 
signal processors, 273 
Partial differential equation, 8 
Partial fraction expansion, 134 
Particle 
based, 327 
approximation, 279 
based processors, 237 
degeneracy problem, 252 
depletion, 289 
problem, 270 
diversity, 266, 268 
filter, 266, 294,299,411 
design, 284, 289,411 
filtering, 237, 238, 242, 314, 324 
algorithm, 252 
filters, 237, 239, 327, 415 
paths, 261 
set, 266 
weights, 246 

Particles, 238,239, 252, 266, 316-318 
Partition the data, 352 
Partitions, 305 
Parzen window, 53, 55 
Passive localization, 172 
Path, 352, 353 
estimate, 350 
known, 354 
Penalty function, 203 
Perfect sampling, 7, 68 
Performance, 283 
statistics, 36 
tests, 273 

Perturbation model, 195 
trajectory, 136 
Phased array radar, 224 
Phase modulation, 333 


Photon 
counter, 31 

emission computed tomography, 31 
emitted, 31 

Physical phenomenology, 8, 122, 397 
Physical systems, 100, 104, 314, 352 
Physics-based approach, 409, 417 
parameter estimation, 400 
processor, 397, 404 
signal processing, 398, 410 
Piecewise constant, 304 
Plane wave model, 296 
Plutonium nitrate, 295 
Point estimate, 3, 264 
Point mass, 238, 239 
Poisson, 31 
counts, 29 
distribution, 28, 31 
driven Markov process, 404 
noise, 28 
processes, 28,30 
rate, 31 

Polar coordinates, 171 
Pole-zero, 145 
Poles, 99 

Population growth, 284, 289 
problem, 285 
model, 334 

Population or system of particles, 241 
Position i it model, 372 

Positive delta family, 220 
Possible paths, 355 

Posterior, 2, 19, 20, 36, 40, 72, 86, 209, 240, 
279, 316 

distribution, 2-4, 36, 37, 41, 44, 51-53, 
81, 82, 84,93, 215,245, 252,256, 

264, 272, 273, 300, 314, 317, 318, 

346, 352, 353 
equation, 150 
filtering distribution, 321 
invariant distribution, 266 
mean, 208 

Posterior probability, 27, 152, 274, 346, 347, 
352, 355, 360, 365 
Posterior tail performance, 261 
Posterior target distribution, 82 
Power spectrum, 142, 196, 235 
Practical application, 241 
Practitioners, 355 
Predicted 

cumulative distribution, 277 
error covariance, 304 
estimate, 168,182 
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measurement, 214, 304 

cumulative distribution, 277 
perturbation, 167 
state distribution, 278 
state error covariance, 212 
state estimation error, 212 
Prediction, 273 

cumulative distribution, 280 
distribution, 3, 37, 42, 151, 261, 279 
error, 182 
estimates, 402 
probability, 366 
recursion, 41, 150 
step, 213,221, 223 
Predictive 

decomposition, 278 
distribution, 221 
measurement cumulative 
distribution, 278 
Predictor corrector form, 307 
Pressure field, 386 
Prior, 2, 3, 19, 37, 40, 52, 81 
distribution, 2 
prediction, 245 
Probabilistic 
axioms, 427 
chain rule, 428 
framework, 36 
information, 246 
model, 335, 339 
propagation model, 148 
transition distribution, 148 
Probability 
bound, 281 

density, 22, 53, 55, 57, 220, 265 
distribution, 3, 25, 36, 64, 66, 70, 198, 
220, 274, 351,424 
distribution estimation, 276 
distributions, 1, 4, 8, 51, 237, 239, 335 
from data samples, 53 
function, 423 

mass, 58, 238, 246,248, 261 
mass distribution, 7 
mass function, 90, 424, 427 
matrices, 359 
Probability of error, 347 
Probability of success, 250 
Probability theory, 426 
Process, 105 
dynamics, 9 
model, 8,151 
noise, 245, 254, 322 
noise covariance, 157, 295 


Processor statistics, 186 
Proportional to, 83 

Proposal distributions, 63, 64,71,75, 81, 82, 
243,253 

Pulse transfer function, 122, 123, 145 
Quadrature 

Bayesian processor, 218 
Kalman filter, 218 
points, 219 

Quantile estimate, 284 
Radar, 331 

Radiation detection problem, 404, 408, 411 
Radiation transport, 409 
Radionuclide, 31 
source detection, 404 
Random, 41 
amplitude, 48 
draw, 54 
inputs, 114 

measures, 238, 246, 250 
parameter, 2, 19 
samples, 5, 7, 52, 56, 57 
sampling, 4, 52, 57, 251 
signal, 2, 6, 36, 65,114, 123,234 
signal processing, 3 
target motion, 321 
telegraph signal, 351 
variables, 53,57, 142,423 
vectors, 27,46,198, 201 
walk, 73,256, 268, 304, 314, 317, 321 
walk, Metropolis-Hastings, 74 
walk model, 268 
walk parametric model, 327 
Range, 227 
Rank, 108, 111 
Rank condition, 109 
Rate parameter, 28,31 
Rayleigh distributed, 48 
Realizations, 60, 423 
Realization problem, 109 
Recursive, 11 
approach, 11 
Bayesian estimation, 36 
estimation, 11 
form, 11, 12 
processor, 197, 209 
Reference 

measurement, 137, 161 
position estimate, 373 
state, 167 

trajectory, 135-137,160,161, 165,167,176 
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Region of strongest support, 268 
Regions of high probability, 318 
Regression 

coefficients, 294 
form, 213 
weights, 200 

Regularization kernel, 264 
Regularization property, 264 
Regularized, 289, 317 
Rejection 

method, 62, 64, 71, 92, 247 
sampling, 62, 70, 79, 87 
method, 62, 92 
Relative frequency, 54 
Relative performance, 273 
Repeated squaring, 101 
Replicating samples, 248 
Replication, 268 
Resample, 322 
Resampled particles, 266 
Resampled uniformly, 252 
Resampling, 246, 248, 254, 261, 264, 

266, 268, 270 
algorithm, 248 
method, 249 
operator, 246 
problem, 237 
process, 266 
scheme, 250,252 
step, 248, 266 
technique, 251 
theory, 247 

Residuals, 214, 278, 280, 281, 283 
method, 251 
prediction, 215 
resampling, 251, 252 
sequence, 273, 278, 281, 282 
Resolvent, 99 
matrix, 98 
Response time, 99 
Reverberant channel, 357 
RLC circuit, 186, 188 
Rosenblatt’s theorem, 278 
Roughening, 256, 257, 258, 295, 317, 318 
Rule of thumb, 108 
Runge-Kutta, 101 

S-plane, 99 

Sample-based simulation methods, 52, 59 
Sampled-data, 95,99,139 
model, 100 
process, 114 
process noise, 113 


state-space system, 101, 114 
system, 19, 100,102, 104,113 
Sample 

impoverishment problem, 256 
mean, 11,67, 183, 284 
space, 423 
variance, 160, 184 
variance estimators, 186 
Sampling, 77 

algorithms, 62, 75 
approach, 4 
distribution, 83 

importance-resampling, 64, 237, 247, 253, 261 

interval, 103 

methods, 51 

problem, 62 

resampling, 197 

scheme, 250, 251 

techniques, 4, 66, 75 

theory, 80, 87 

Satellite communications, 224 
Scaled kernel, 265 
Scaling and squaring, 101 
Sequence estimation, 350 
Sequential, 11, 64 
Sequential approaches, 80 
Sequential Bayesian, 39 
estimation, 36 
estimators, 238 
framework, 321 
posterior estimator, 40 
processor, 41, 44, 149, 150 
recursions, 268, 336 
Sequential 

bootstrap processor, 404 
estimation, 84, 237, 239, 248, 273 
estimation framework, 36 
importance sampler, 87 
importance sampling, 86, 240, 241 
methods, 36 
Monte Carlo, 299 
approach, 313 
method, 237 
processing, 11, 169 
processors, 2 

simulation-based techniques, 246 
solution, 348 
updating, 253 
Series approach, 101 
Sifting property, 7, 83, 223 
Sigma-point (unscented) transformation, 201 
Sigma-point Bayesian processor, 197, 230, 

235 
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Sigma-point 
design, 256 
processor, 256 
transformation, 51, 200 

Sigma points, 200, 202, 207, 208, 218, 222, 233, 
271 

Signal-to-noise ratio, 8 
Signal enhancement, 36, 41, 150 
Signal estimate, 2 

Signal processing, 6-8, 52, 65, 79, 97, 182, 351, 
357, 406 

Signal processing model, 408, 409 
Significance level, 184,185, 280 
Similarity transformation matrix, 102 
Similarity transformations, 102 
Simulated trajectories, 248 
Simulation based, 7 
approach, 4, 64 
Bayesian processors, 51 
methods, 53, 56 
sampling, 87 
technique, 246 
solution, 67 

Single input/single output, 121 
Singular values, 110 
Singular vectors, 110, 111 
Skewness index, 284 
Slice sampler, 78, 87, 92, 93 
Smart sensor, 398 
Smoothing, 345 
parameter, 56 
relation, 350 
variable, 349 
Sonar, 172 

Space-time processing problem, 318 
Spatio-temporal channel, 357 
matched filter, 357 
Spectral factorization, 121 
Square-root matrices, 111 
SSPF algorithm, 241 
Stability, 99 
Stabilized, 193 
Stable realization, 111 
Standard Gauss-Markov model, 120 
Standard uniform, 278 
State, 95,96 
State information, 210 
State-input transfer matrix, 98 
State-space, 2,96, 157,191, 220,295 
form, 135, 145, 160, 295 
forward propagator, 384 
models, 95, 96, 104,122,139,147,148, 182, 
321 


particle algorithm, 239 
particle filters, 237, 241, 285, 289 
representations, 95, 96, 105, 112, 121, 122, 
149,237-239,339 
structures, 121 
transition, 252 
State 

covariance, 116 
delay, 194 
equations, 98, 100 
error covariance, 151 
error covariance update, 215 
error prediction, 213 
estimate, 167, 346 
(sequence) estimation, 349 
estimation, 299, 300, 316, 345, 346 
errors, 151, 152, 186 
problem, 150, 315, 334, 345, 347, 362 
mean vector, 115 

parameter estimation, 26, 301, 303 
perturbation, 162 
posterior distribution, 300, 365 
prediction, 210 
prediction probability, 365 
sequence, 352, 354, 358, 361 
estimation, 350 
transition, 101, 314, 344, 354 
transition distribution, 322 
transition matrix, 99-101, 106, 193, 335, 
340, 367 

transition mechanism, 46 
transition model, 315 
transition probability, 149, 285, 338 
transition probability matrix, 336, 343 
variables, 96, 209 
variance, 114, 116 
vector, 96,98 
Static parameter, 318 
Stationary, 70, 117 
chain, 343 
distribution, 64 
processes, 118 
Statistical 

approximation, 198 
estimation, 67 
hypothesis test, 183 
indexes, 283 

inferences, 2, 3, 37, 52, 246 
linearization, 270 
white sequence, 160 
measure, 65 
mechanics, 64 
sampling techniques, 57 
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Statistical ( Continued) 
signal processing, 5, 36, 43, 147 
simulation-based techniques, 52 
tests, 182, 237, 273, 277, 278, 284, 289 
Statistics, 240 
Steady state, 117, 186, 330 
Stochastic 

deconvolution problem, 413 
linearization, 230 
models, 382 

processes, 169, 335,425,427 
realization, 339 
sampling, 65 
system, 4 
Stopping rule, 177 
Strong Law of Large Numbers, 65 
Structural model, 330 
Student T distribution, 74 
Suboptimal, 310 
Sufficient statistic, 23, 32 
Sum-squared error criterion, 35 
Superposition integral, 100 
Survival of the fittest algorithm, 253 
Switching model, 366 
Symmetric distribution, 72 
Synthetic aperture, 296, 318, 319, 322, 324 
towed array, 327 
Systematic resampling, 248, 251 
System identification, 275, 350, 415 
System model, 105 
Systems theory, 96, 98, 107, 112, 350 

Tail Index, 284 
Tank, 295 
Target, 195, 234 

distribution, 62, 70, 71, 73, 77, 78, 79, 81, 

82, 92 

posterior distribution, 52, 71, 239, 246 
tracking, 324 

Targeted posterior distribution, 53, 246 
Taylor-series approach, 104 
Taylor series, 51, 95, 101, 103, 104, 114, 137, 
138,169,177, 188,193,197, 209, 332,414 
Temporal incoherent processor, 387 
Test statistic, 183-185, 280-282 
Thermal deflection, 400 
Time reversed Green’s function, 359 
Time varying, 101, 150, 167, 332 
Time-varying volatility, 294 
Time delay, 196, 234, 297 
Time domain representation, 122 
Time invariant systems, 106 
Time reversal, 357, 358 
Time reversal processing, 358, 362 


Time reversible, 70 

Total observation probability, 342, 343, 345, 346 
Total observation sequence, 345 
Towed array, 324 

Tracking problems, 171, 172, 223, 224, 230, 253, 
339 

Tracking telescope, 18 
Training 
data, 354 

sequences, 196, 234, 354, 355, 357 
sets, 352 

Trajectory estimation, 318 

Transfer function, 97, 106, 109, 122, 134, 415 

Transfer function matrix, 98 

Transformation, 57 

Transformed residual, 280, 281, 284 

Transformed statistics, 208 

Transient problems, 412 

Transition 

distribution, 75, 78 
kernel, 70, 71, 266 
matrix, 98, 335, 366 
posterior, 317 

prior, 245, 253, 270, 278, 279, 289 
probabilities, 148, 336, 343, 352 
probability, 4, 70, 71, 78, 148, 264, 321, 336, 
344, 404 
matrix, 365 
Transitions, 354 
True posterior, 265, 273 
Truncated Gaussian, 63 
Truncation error, 104 
Tuned processor, 183, 186, 256 
Tuning, 258 

UD-factorized form, 307 
Unbiased, 65, 66 
estimate, 45 
estimator, 81 

Unconditionally unbiased, 34 
Uncorrelated, 141 
Uncorrelated noise, 120 
Unequally sampled data, 101 
Uniform convergence, 222 
Uniform distribution, 56, 58, 273 
Uniform intervals, 79 
Uniformity property, 281 
Uniformly distributed, 56, 79, 281-283 
Uniformly distributed random variable, 58 
Uniformly sampling, 248 
Uniform 
proposal, 93 
random samples, 5 
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random variable, 56 
samples, 250 
sampling, 87 
distribution, 67 
procedure, 248 
simulation, 62 

transformation theorem, 59, 277 
variates, 56, 57,61 
weighting, 248, 252 
Unimodal, 256, 277 
Unimodal distribution, 237 
Unit-step function, 60 
Unnormalized weight, 84 
Unobservable, 107 
Unscented, 270 

Unscented Kalman filter, 197, 230, 324 
Unscented transformation, 200, 270 
Update, 42 

Update equation, 162, 182 
Update step, 219 
Updated, 355 
error covariance, 154 
estimate, 26,167,176 


Validation problem, 273 
Validity, 273, 278 
Variance, 125,426 
Variance equations, 118 
Vector calculus, 21 
Viterbi, 355, 362 
algorithm, 349, 350, 357 
approach, 350, 360 
training, 355 
Volatility, 294 


Wavefront curvature, 296 
Weighted function, 83 
Weighted particles, 238 
Weighted quadrature points, 220 
Weighted sum-squared residual, 185, 374 
Weighting function, 40, 53, 82-84 
Weighting matrix, 182 
Weight recursion, 244, 316 
Weights, 238, 245 
Weight variances, 242, 289 
White, 115, 121, 146, 160, 174, 181, 227, 
273, 403 
Whiteness, 181 

Whiteness tests, 160, 184, 185, 273, 280, 284 
Whitening transformation, 265 
White noise, 95,143 
Wiener, 35 

Wiener-Kalman filtering, 121 
Wiener solution, 35 
Wold decomposition, 122 

Zero mean, 142, 146, 160, 196, 227, 234, 
273, 403 

Zero mean test, 181, 184 
Zero-mean-whiteness, 378 
Zero-mean/whiteness detector, 374 
Zero-mean/whiteness tests, 277, 280, 310, 
312 

Z-transforms, 106, 109, 117, 118 





